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Preface 



Resolution of lexical ambiguity, commonly termed "word sense disambiguation" , is expected to 
improve the analytical accuracy for tasks which are sensitive to lexical semantics. Such tasks 
include machine translation, information retrieval, parsing, natural language understanding and 
lexicography. Reflecting the growth in utilization of machine readable texts, word sense disam- 
biguation techniques have been explored variously in the context of corpus-based approaches. 
Within one corpus-based framework, that is the similarity-based method, systems use a database, 
in which example sentences are manually annotated with correct word senses. Given an input, 
systems search the database for the most similar example to the input. The lexical ambiguity 
of a word contained in the input is resolved by selecting the sense annotation of the retrieved 
example. 

In this research, we apply this method of resolution of verbal polysemy, in which the sim- 
ilarity between two examples is computed as the weighted average of the similarity between 
complements governed by a target polysemous verb. We explore similarity-based verb sense 
disambiguation focusing on the following three methods. First, we propose a weighting schema 
for each verb complement in the similarity computation. Second, in similarity-based techniques, 
the overhead for manual supervision and searching the large-sized database can be prohibitive. 
To resolve this problem, we propose a method to select a small number of effective examples, for 
system usage. Finally, the efficiency of our system is highly dependent on the similarity com- 
putation used. To maximize efficiency, we propose a method which integrates the advantages of 
previous methods for similarity computation. 
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Chapter 1 

Introduction 



1.1 Background 

Natural language processing (NLP) involves resolution of various types of ambiguity. Lexical 
ambiguity is one of these ambiguity types, and occurs when a single word (lexical form) is 
associated with multiple senses or meanings. For applications which are sensitive to semantic 
denotation, or more precisely lexical semantics, this ambiguity type can pose a major obstacle. 
Resolution of lexical ambiguity, which is commonly termed "word sense disambiguation" (WSD), 
is expected to improve the quality of the following research fields. 

• Machine translation (MT) can safely be identified as one of the major beneficiaries of word 
sense disambiguation, because a single word in a source language is frequently associated 
with multiple translations in a target language, each of which is often associated with a 
different sense to the source word. For example, the "tax" and "obligation" senses of the 
English word duty correspond to the French translations of droit and devoir, respectively. 
Sense disambiguation of duty is expected to allow this word to be translated appropriately 
for a given context. In fact, a number of MT-oriented word sense disambiguation methods 



have been explored based on this notion [10, 25 



Information retrieval (IR) and text categorization (TC) suffer from the effects of noisy 
words associated with multiple senses, and IR/TC systems can easily end up relating 
documents containing the same words but in different senses (usages). For example, docu- 
ments containing the word AIDS can easily be associated with those containing the word 
aids. Conventional systems have tentatively avoided this problem through usage of in- 



verse document frequency (IDF) [127]. The rationale behind IDF is that words which 



rarely occur over document collections are valuable, or that in other words, the IDF of 
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a word is inversely proportional to the number of documents containing that word. As 
Krovets |^4| identified, this can be recast in the context of lexical ambiguity. That is, 
words of higher frequency also tend to be associated with a greater number of senses, and 
therefore the degree to which these noisy words affect the system output can be minimized 
by introducing IDF. However, as a number of experimental results have shown/suggested, 
word sense disambiguation is a crucial task for the further improvement of IR/TC sys- 



tems |g, |7|, ^, |T2|, |T2|, pi] ] 



Syntactic analysis (or parsing) often fails to identify the correct syntactic structure for 
an input sentence when syntactic relations are associated with semantic content. Prepo- 
sitional phrase (PP) attachment problems^ and predicate-argument structures associated 
with selectional restrictions are immediate examples of this problem type, in that they 
require the intervention of the semantic content of lexical entries for knowledge represen- 
tation. Given the fact that syntactic and semantic (lexical) ambiguity are not independent 
of each other, a number of methods have proposed the mutual resolution of these two am- 



biguity types 1 84, 



• So called class-based NLP approaches |66, |120| ] are also potential beneficiaries of word sense 
disambiguation^ . These approaches involve mapping of each word entry to a semantic class 
(usually taken from a thesaurus taxonomy). Consequently, disambiguation of word senses 
is poignant, because each word is often associated with multiple class candidates, which 
are closely related to the senses of that word. 

• In natural language understanding (NLU), semantic structures are constructed by consid- 
ering the meaning of each word Kilgarriff [^] argued that current practical NLU 
systems, such as dialogue and information extraction (IE) systems, have commonly em- 
ployed domain-specific knowledge representation rather than word sense disambiguation, 
in order to counter lexical ambiguity. However, we would like to note that his observation 
does not immediately reject the possibility of the potential contribution of word sense 
disambiguation to NLU systems. 



Kilgarriff |68| also points to the advantages of word sense disambiguation in lexicography. 
By this is meant that sense-annotated linguistic data reduces the considerable overhead 
imposed on lexicographers in sorting large-scaled corpora according to word usage for 



^Ravin jll8| , for example, focused on the resolution of PP-attachment problem through word sense 
disambiguation. 

number of methods described in the previous item ("parsing") can also be seen as instances of this category. 
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different senses. Not only lexicography, but the general process of compiling linguistic 
resources is expected to improve through the interaction between lexicographers and com- 
puters |5[ |146(| . In addition, word sense disambiguation techniques can also allow language 
learners to access example sentences containing a certain word usage from large corpora, 
without excessive overhead. 

It should be noted that past word sense disambiguation methods have not contributed to all 
the research given above. In fact, no quantitative data has been forthcoming documenting 
empirical improvement through word sense disambiguation, except for in the fields of MT 
and IR/TC |^0[. One reason for this past failure to enhance existing methods is presumably due 
to the immaturity of word sense disambiguation research, which stimulates us to further explore 
this exciting research area. 

Reflecting the growth in utilization of machine readable texts, word sense disambiguation 
techniques have been explored variously in the context of "corpus-based NLP approaches"^. 
These methods generally use a corpus in which component words of each example sentence are 
annotated (either manually or automatically) with their correct word sense, to automatically 
induce rules or probabilistic models for disambiguation. Unlike conventional rule-based ap- 
proaches relying on hand-crafted selectional rules (some of which are reviewed, for example, by 
Hirst [^] and Small et al. [|135|| ), corpus-based approaches release us from the task of generalizing 
observed phenomena through a set of rules. While certain methods require manual annotation 
of the given corpora (namely "supervised methods")^, other methods exclude or minimize the 
overhead for supervision (namely "unsupervised methods"). However, we observe that the ap- 
plicability of unsupervised methods has so far been limited to relatively specific applications, 
and that the overhead for supervision still remains as a major drawback of corpus-based word 
sense disambiguation (we will elaborate on this issue in Chapter |2|). 

1.2 Focus of this Research 

First, let us precisely state the focus of this research, i.e. which subcategory of word sense disam- 
biguation we are targetting, given the considerable variation in types and associated methods of 
lexical disambiguation (between noun and adjective senses, for example). At the same time, we 

^Corpus-based approaches have been explored in terms of other types of NLP research, a sample of which are 
reviewed, for example, by Church and Mercer [^ . 

''One may argue that supervised corpus-based methods have not released us from hand-encoding tasks. How- 
ever, we would like to note that manual annotation is still easier than describing rule sets based on human 
introspection. 
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consider the difficulty of the task, which can range between coarse-grained (totally distinct) and 
fine-grained (closely related) word sense distinctions. In this research, we explore disambigua- 
tion of verb senses (verbal polysemy^) based on an existing machine readable dictionary. Note 
that the dictionary we use provides relatively fine-grained verbal polysemy. Among the research 
fields described in Section our research focus is expected to improve the quality of machine 
translation, parsing, natural language understanding and lexicography. With regard to current 
information retrieval systems, the disambiguation of noun senses seems to be more crucial a 
task than that for verb senses (for one thing, keywords and user queries usually comprised of 
noun phrases). However, the potential contribution of verb sense disambiguation extends to a 
significant proportion of the tremendous range of information retrieval systems. 

Second, let us describe the approach we use to tackle the verb sense disambiguation task. 
Obviously, our research methodology must be contextualized in terms of past research literature 
associated with corpus-based word sense disambiguation, for which the reader should refer to 
Chapter ^. In this process, it is important not to limit our focus only to verb sense disam- 
biguation, because methods employed in other types of disambiguation may also be applicable. 
In brief, our system employs a similarity-based method, in which disambiguation is performed 
based on the similarity between a given input sentence and example sentences associated with 
each verb sense. The similarity is computed by averaging the similarity for each case (or case 
filler noun) syntactically governed by the target verb. Through preliminary experimentation on 
Japanese verbs, we identified a number of problems associated with our research focus. In the 
following, we describe our approaches to these problems. 

• The degree to which each case contributes to verb sense disambiguation is not consistent. 
One may intuitively understand that in the case of English, for example, object case is more 
closely related to verb senses than subject case. The same observation can also be made 
in the case of Japanese. We explore a method of introducing this notion computationally 
into similarity-based methods, and demonstrate the effectivity of our proposal through 
comparison with a number of different methods (see Chapter |3| for details). 

• As with most corpus-based word sense disambiguation systems, our system uses a large- 
scaled corpus annotated with correct verb senses. However, a considerable overhead is 
required when one tries to manually perform the annotation process. One possible solution 
would involve automatic annotation, that is, an unsupervised method. However, our 
experiments show that at least for the particular unsupervised method targetted in our 



^We will define "polysemy" in Section |2.l| 
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research, unsupervised methods still find it difficult to match the performance achieved for 
supervised methods. In view of this result, we explore a semi-supervised method in which 
we selectively sample a small number of effective example sentences from a given corpus 
in the annotation process. In other words, our sampling method aims at minimizing the 
overhead for manual supervision, without degrading the system performance (see Chapter ^ 
for details). 

• We also identified that the performance of our system is highly dependent on the similarity 
computation between the input and example sentences. Roughly speaking, past approaches 
for similarity computation can be subdivided into statistically-driven and thesaurus-driven 
methods. In addition, integration methods combined these two approaches have recently 
been proposed. We also explore the similarity computation in the context of the integration 
of different methods, and demonstrate the effectivity of our proposal for our verb sense 
disambiguation system (see Chapter ^ for details). 

1.3 Outline of the Proceeding Chapters 

Chapter |2| surveys past research on word sense disambiguation, and identifies open questions in 
the field. Chapter ^ describes the overall architecture of our similarity-based verb sense disam- 
biguation system, in which we newly introduce the notion of degree of case contribution to verb 
sense. Chapter |^ proposes a selective sampling method, which samples a smaller-sized, effective 
example set for use with our verb sense disambiguation system, so as to minimize the overhead 
for supervision and that to search a large-sized corpus. Chapter |5| explores and discusses the 
similarity computation used in our system, through integration of thesaurus taxonomy and co- 
occurrence statistics. Finally, Chapter ^ summarizes our contributions and discusses outstanding 
issues. 
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Chapter 2 



Past Research on Word Sense 
Disambiguation 

Reflecting the rapid growth in utilization of machine readable texts, word sense dis- 
ambiguation techniques have been explored variously in the context of "corpus-based 

NLP approaches". This chapter surveys past research associated with corpus-based 
word sense disambiguation, focusing mainly on methodology and evaluation criteria. 

2.1 Overview: Terminology and Task Description 

The task of a word sense disambiguation system ("system" or "WSD system", hereafter) is to 
resolve the lexical ambiguity of a word-*^ in a given context. To put it more precisely, the term 
"lexical ambiguity" refers to two different concepts: "homonymy" and "polysemy". The former 
is the case where two different words happen to have the same lexical form, and the latter is the 
case where one word has several (related) meanings. Conventionally, the distinction between 
bank ("river edge") and bank ("financial institution") has been used as an example of homonymy, 
and rust (verb) and rust (noun) for polysemy. In this dissertation, we will generally use the 
term "polysemy" to refer to both lexical ambiguity types, because (a) the difference between 
these two ambiguity types has been relatively less controversial in word sense disambiguation 
tasks (although from a linguistic point of view, the two ambiguity types should be rigorously 
defined), and (b) the focus of this research is on disambiguating verbal polysemy rather than 
homonymy. To derive plausible word senses (polysemy), most past WSD systems have used 
lexical resources, such as machine readable dictionaries (MRDs) or thesauri. Thus, the task 

^In most cases, past methods disambiguate only "content word", and "functional words" such as prepositions 
are beyond the scope of word sense disambiguation. 
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of WSD systems can be termed as "categorization" because a plausible word sense is selected 
from predefined candidates. Note that this task should not to be confused with automatic 



identification of word senses |41, 144, 155] and word clustering (grouping) p9| , p^ , |113| , |13S[| . 
One may notice that the distinction between "ambiguity" and "vagueness" is also a controversial 
linguistics issue [44, 140 1. A comprehensible example of vagueness would be aunt, which is 



unspecific between "father's sister" and "mother's sister" without explicit context. However, this 
type of under-specification is beyond the scope of word sense disambiguation unless dictionaries 
list father/mother's sister as different senses of aunt. 

Let us leave the definition of lexical ambiguity and turn to the process of word sense dis- 
ambiguation. Given an input sentence containing polysemous word(s), most WSD systems first 
preprocess the input to extract a set of features (clues) used for the disambiguation. This prepro- 
cessing typically involves morphological/syntactic analysis because the parts-of-speech of words 
appearing in the input and syntactic relations involving polysemous words, can be informa- 
tive features. Macroy [90|, for example, identified syntactic tags, morphology, collocations, and 



word associations as the most important sources of information for word sense disambiguation. 
Thereafter the system interprets polysemous word(s) by selecting a single plausible word sense. 
While certain systems interpret only one polysemous word in the input, other systems simulta- 



neously interpret all polysemous words appearing in the input. Wilks and Stevenson [152] call 
this second task type "word sense tagging" (based on the analogy of part-of-speech tagging). 
However, the difference between the two task types is relatively unimportant for this research. 



Section |2.2| classifies past word sense disambiguation methods. Section then describes the 



way past research has evaluated word sense disambiguation methods. Section 2.4 describes re- 



lated NLP research which is expected to enhance word sense disambiguation. Finally, discussion 



associated with word sense disambiguation is added in Section 



2.2 Different Methodologies for Word Sense Disambiguation 

This section surveys different past methods for corpus-based word sense disambiguation, to 
clarify problems tackled in this research. First, Section |2.2.1 classifies past methods, according 



to their induction mechanism (rule-based method or probabilistic model, say). Second, Sec- 
tion 2.2. 2| focuses on a different viewpoint, that is, supervised vs. unsupervised learning meth- 



ods. While supervised methods require manual annotation of correct sense to each polysemous 
word contained in a corpus, unsupervised methods automatically acquire corpora annotated 



with (presumably correct) word senses. These automatic methods are what Section 2.2.2 fo- 



cuses on principally. It should be noted that criteria for the classification of past methods can 
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vary depending on the viewpoint or interest of the researcher^. 



2.2.1 General classification of methodologies 



This section classifies past methodologies as shown in Figure 2.1, which first divides methods 
into two different approaches. The first approach can be called "qualitative approach"^ and 
uses selectional rules associated with each word sense candidate. Given an input containing 
a polysemous word, the system deterministically selects the sense(s) for which the rules are 
satisfied. Generally speaking, the granularity of selectional rules can be a problem for this 
approach. That is, specified rules often fail to select a word sense for exceptional inputs. On 
the other hand, generalized rules run the risk of selecting incorrect word senses. To counter 
this problem, the second approach - what we call the "quantitative" approach - computes 
scalable values for each word sense candidate, and selects the sense with maximal value as the 
interpretation of the polysemous word. Compared with the rule-based approach, this approach is 
more robust for exceptional inputs^. Both approaches generally use a corpus, in which examples 
are associated with each word sense candidate. We shall call this corpus the "training data". 
Note that in this section, examples in the training data can be annotated either manually or 
automatically, without loss of generality. The qualitative approach can be further subdivided 
in terms of rule format. Subcategories described in this section are "selectional restrictions", 
"decision trees" and "decision lists". The quantitative approach can be divided in terms of 
the method used to compute scalable values, as "probabilistic models" and "similarity-based 
methods" . All subcategories of these two approaches correspond to the different titles of the 
following items. 



qualitative approach < 



selectional restrictions 
decision trees 
decision lists 



quantitative approach 



{probabilistic models 
similarity-based methods 



Figure 2.1: Classification of different methodologies of word sense disambiguation 



^For example, Wilks and Stevenson |152[ | classified past word sense disambiguation methods diS^erently. 
^One may notice that the terms "symbolic/rule-based approach" can be interchangeably used for qualitative ap- 



proach. However, we used the term "qualitative approach" to make the direct con tras t to 

*A number of compromised methods can be found in past research 
as preferences rather than constraints 



quantitative approach" . 
Wilks [L5C] used selectional restrictions 
Uramoto |142| combined qualitative and quantitative approaches. 
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Selectional restrictions Selectional restrictions [^], which impose constraints on arguments 
for a given word (sense), have commonly been used in word sense disambiguation relying on 
hand-crafted rules |l^, |5j, Let us take the following example sentences (a) and (b), which 
contain different usages of the verb employ, that is, "to hire" and "to accept": 

(1) a. The facility will employ new employees, ("to hire") 
b. The committee employed his proposal. ("to accept") 
One can intuitively differentiate the senses of employ in sentences (1-a) and (1-b) with the com- 
plements of each employ. To be more precise, employ in (1-a) restricts its subject and object 
nouns to those associated with the semantic features HUMAN/ORGANIZATION and HUMAN, re- 
spectively. On the other hand, employ in (1-b) restricts its subject and object nouns to those 
associated with the semantic features HUMAN/ORGANIZATION and IDEA, respectively. Conse- 
quently, given employees as the object, the sense "to hire" is selected as the interpretation of 
employ in (1-a), and the sense "to accept" is ruled out. The same reasoning can be used to select 
the sense "to accept" as the interpretation of employ in (1-b). One may notice that selectional 
restriction can also disambiguate polysemy of verb complements (the subject and object). For 
example, facility in (1-a) has multiple senses, a sample of which are "installation", "proficiency" 
and "readiness". However, the selectional restriction imposed for the subject of employ ("to 
hire") can correctly select the sense "installation" as the interpretation of facility. It should 
be noted that the polysemy of both facility and employ are theoretically disambiguated simul- 
taneously. In fact, the disambiguation process can be seen as mutually propagating semantic 
constraints to each polysemous word through selectional restriction^. 

However, considerable human effort is required to describe large-scaled selectional restric- 
tions. Manual construction is also associated with human bias and inconsistency of granularity. 
Besides this, revision requires additional human overhead, that is, lexicographers have to la- 



boriously identify and revise associated entries. Resnik |12C] proposed an information-theoretic 
method to automatically identify selectional restrictions, which is expected to counter this prob- 
lem to some degree. Resnik identified selectional restrictions as semantic classes defined in the 
taxonomy of the well-known English semantic network WordNet |9^®, and only those nouns 
dominated in the taxonomy by the identified class can satisfy the restriction. While Resnik 



used this method to resolve syntactic ambiguity, Ribas [124] applied it to disambiguate senses of 
case filler nouns (for example, disambiguation of the polysemy of facility in (1-a) as performed 
above). The basis of this method is to estimate the information-theoretic association degree 



^Lytinen |8^ ] and Nagao used constraint propagation method to simultaneously resolve syntactic and 
semantic ambiguity. 

® WordNet terms semantic classes as "synsets". 
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between case fillers and semantic classes, for each verb (or verb sense). Intuitively speaking, the 
association degree gives a greater values for semantic classes that are likely to appear as comple- 
ment of a given verb sense. Formally speaking, the association degree between verb sense s and 



class r (restriction candidate) with respect to case c is computed by Equation 2.1 |12C, 124]. 



Ais,c,r) = Pir\s,c)-\og^^^ (2.1) 

Here, P{r\s,c) is the conditional probability that a case filler marked with case c of sense s is 
dominated by class r in the WordNet taxonomy. P{r\c) is the conditional probability that a 
case filler marked with case c (disregarding verb sense) is dominated by class r. The distribution 
obtained from training data is usually used to estimate each probability. 

Decision trees In spite of its long history of applications in AI research, "decision trees" 
have rarely been applied to word sense disambiguation^. Among a number of proposed decision 



tree algorithms "C4.5" [117] has been used relatively commonly as a benchmark comparison. 
Mooney [^] and Pedersen and Bruce ]|111[] (individually) compared the performance of various 
word sense disambiguation methods with the C4.5 algorithm. Tanaka ]ll37[] used the C4.5 algo- 
rithm to acquire English- Japanese verbal translation rules. One may notice that this task can 
also be seen as a type of rule induction for word sense disambiguation, because a unique English 
verb can be interpreted as different Japanese verbs. Figure [2^ shows a fragment of the decision 
tree for the English verb take, which corresponds to multiple Japanese translations such as er- 
abu ("to choose"), tsureteiku ("to take along") and motteiku ("to take away"). Given an input 
containing take followed by the object noun him and preposition to, the decision tree selects 
tsureteiku as the interpretation, by traversing the branches corresponding to "object=hini" and 
"prep=to". 

Here, let us devote a little space to explaining the basis of the C4.5 algorithm. In this decision 
tree algorithm, classification rules are formulated by recursively partitioning the training data. 
Each nested partition is based on the feature value that provides the greatest increase in the 
information gain ratio for the current partition. The final partitions correspond to a set of 
classification rules where the antecedent of each rule is a conjunction of the feature values used 
to form the corresponding partition. Let {Ci, C2, . . . , C^} denote class candidates, one of which 
is assigned to the input. Suppose we already have a partition, which divides the set T of training 



^Okumura and Tanaka |10E| ] proposed generalized discrimination networks (GDNs), which can be seen a variant 
of decision trees, for word sense disambiguation. However, we will not discuss their method further here, because 
they focus on exploration of an "incremental disambiguation model" and not automatic construction of the 
networks the main topic of this research. 
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object^hirn^^^ ^^^object=box 

prep=NIL/\ motteiku ("to take away") 

/ prep=to 
erabu ("to choose") 

tsureteiku ("to take along") 

Figure 2.2: A fragment of the decision tree for the English verb take 

data into subsets {Ti, T2, . . . , T„}. Given that freq{Ci, T) and |T[ denote the number of training 
data in T that belong to class Ci and the number of training data in set T, respectively, the 



"entropy" (uncertainty) of set T can be estimated by Equation 2/2 



info{T) = -}_^ — log — (2.2) 

j I I II 

Suppose T is partitioned into n subsets using feature value X as a classification rule, the expected 



entropy over partitioned subsets, infox{T), is estimated by Equation 2.2. 

infoxiT) = E P • i^fom) (2.3) 

i ' ' 

The decrease in entropy resulting from the partition, which represents the information gain for 



X, is then estimated by Equation 2.4. 



gain{X) = info(T) - infox{T) (2.4) 



Decision lists Decision lists are a form of rule representation as proposed by Rivest |126U , 
and consist of tuples of the form "(condition, value)". As Rivest observes, decision lists can be 
seen as "if-then-else" rules, in other words, exceptional conditions appear earlier while general 
conditions appear later in the list^. Given a query, each condition in the decision list is applied 
sequentially until a condition which is satisfied by the query is found. Thereupon, the value 
which corresponds to that condition is selected as the answer. Yarowsky applied decision lists 



to the task of accent restoration 156 | (this is one type of lexical disambiguation, in which a 



single word is associated with multiple pronunciations), and word sense disambiguation | 157 ] 



The last condition accepts all cases (namely "true"), otherwise the system could potentially fail to make any 
decision for certain input types. 
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In Yarowsky's case [156, 157], each condition corresponds to a word collocation which can be 
used as evidence to resolve lexical ambiguity, and each value corresponds to a correct word 
sense (or pronunciation). Since manual identification of effective conditions is expensive and 
inconsistent, Yarowsky used word collocation (within a fixed window size) obtained from a large 
corpus to automatically identify effective evidence types. The effectivity degree of a given piece 
of evidence is estimated as the likelihood that it supports a given sense candidate more strongly 
than another^. Formally speaking, this notion is represented as the log-likelihood, that is, the 
ratio between the conditional probability that sense si and sense S2 occur, respectively, given 
evidence E (Equation ^) . 



log 



P{si\E) 



(2.5) 



Pis2\E) 

In decision lists, evidence (along with their supporting word senses) is sorted according to log- 
likelihood, in descending order. Figure shows a fragment of the decision list trained for the 
disambiguation of the word plant ( "organism" / "factory" ) [157], where each piece of evidence 
denotes a specific collocational pattern or word collocating with a certain distance associated 
with plant. For example, given the input containing the pattern plant height, the interpretation 
for plant is "organism". Note that Yarowsky Jl56| , 157] used a method binary sense distinction 
method, in that the number of sense candidates was limited to two. To apply this method 
to the disambiguation of words with multiple ambiguity, the denominator in Equation is 
presumably computed based on the probability of (a) all the other sense candidates or (b) only 
the most competitive sense candidate (that is, the sense candidate with the second highest 
probability given evidence E). 



evidence 


sense 


plant growth 

car (within ±fc words) 

plant height 

union (within ±fc words) 
equipment (within ±k words) 


organism 
factory 

organism 
factory 
factory 



Figure 2.3: A fragment of the decision list for the word plant 



Pedersen and Bruce [111] used the "CN2" rule induction algorithm ]23] as one form of 
comparison in their "decomposable model" The CN2 algorithm generates decision lists based 
on a given set of training data. The rule induction algorithm consists of finding rules, and 



^In Yarowsky's case |156| , [L57{ , the number of sense candidates for each word was hmited to two. 
^"See the foUowing "Probabihstic models" item for details of the decomposable model. 
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measuring the significance of tliose rules based on entropy and their coverage, over the range 
of the training data. The quality of decision hsts is highly dependent on the sequence of rules, 
and thus naive rule sequences end up degrading the disambiguation performance. To overcome 
this problem, the latest version of CN2 ||22| optionally employs an "unordered rule" technique. 
Let us explain, in passing, the basic algorithm here. The training procedure is the same as in 
the original version. However, given an input and decision rule set, rather than applying the 
rules in a fixed order, the algorithm collects all the rules which the input satisfies, and their 
associated conditions. Note that by disregarding the rule order (in other words the "context"), 
different rules can potentially support different values and consequently ambiguity often remains 
unresolved. In such a case, additional consideration is given to the distribution of the training 



data covered by the collected rules, in choosing a unique value. Let us take Figure 2.4 as an 
example rule set where each line corresponds to a rule for the two values of "bird" and 
"elephant". In this figure, the "coverage" column denotes the number of training examples 
which satisfy the corresponding rule, for the values "bird" and "elephant" , respectively. Let us 
consider the following example input. 



<size=large, beaked=yes, legs=two, feather=yes, flies=no> 



The rule sets in Figure 2A cannot uniquely decide the answer for this input, which satisfies all 
three rules. However, the summation of the coverage for these rules "(35,10)" puts the preference 
for the final answer on "bird" over "elephant" . 



rule 


coverage 


if 


legs=two and feather=yes 


then 


class=bird 


(13,0) 


if 


size=large and flies=no 


then 


class=elephant 


(2,10) 


if 


beaked=yes 


then 


class=bird 


(20,0) 



Figure 2.4: An example rule set for CN2 



Probabilistic models From the viewpoint of probability theory, the task of word sense dis- 
ambiguation is to select the sense with maximal probability for a given input^^. The probability 



for word sense s, P{s\x), is commonly transformed into Equation 2^ through use of the Bayesian 
theorem. 

P{s)-P{x\s) 

Pis\x) = (2.6) 



Some statistical methods are reviewed, for example, by Charniak 
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In practice, P{x) can be omitted because this factor is constant for all the sense candidates, and 



therefore does not affect the relative probability for different senses (Equation 2.7). 



argmaxP(s|j;) = argmaxP(s) • P{x\s) (2-7) 

The probability of sense s, -P(s), is usually estimated based on its distribution, obtained from 
training data. Thus, the performance of the method fundamentally depends on the method for 
approximating P{x\s). For the purposes of general discussion, let the input be represented by a 
vector comprised of word sense disambiguation features, as given below. 

<Fi = fi, F2 = f2, • • • , Fn = fN> 

Here, Fi and fi are the i-th feature type and its value, respectively. Let us summarize the diverse 
range of past methods based on the following two principles. 

The first principle is to identify an informative feature set, and ideally, different feature sets 
for each different target polysemous word. Typically, words which saliently collocate with a 
word sense are used as features. In this case, each feature takes a binary value, that is, 1 for 
existence and for absence in the input. A number of methods have been proposed to automat- 



ically identify informative collocating words (termed "salient words" |154] or "indicators" ||6 



Yarowsky [154] used mutual information between w and sense s (in Yarowsky's case, word senses 
are semantic categories defined in Roget's thesaurus [Q) to estimate the degree of salience of 
word w to sense s. Intuitively speaking, the mutual information between two phenomena gives 
a greater value when these phenomena are more likely to co-occur. The mutual information of 
w and s, I{w] s), is computed as shown in Equation |2.8| . 

I(^.;.)=log^J^ (2.8) 

Here, P{w\s) is the probability that w appears given s, and P{w) is the probability that w 
appears in the context^^. These factors are estimated based on the relative distributions of w 
and s in the training data. Figure |2.5| shows examples of salient words related to the categories 



ANIMAL and TOOLS [154]. Intuitively speaking, when words like species and family appear 



in the input, the probability for ANIMAL tends to be greater than that for TOOLS. Justeson 



and Katz [30| select word w as indicator of sense s such that w appears more frequently with 



^^Strictly speaking, P{w\s) and P{w) should be denoted as P(F„ = l|s) and P{F^ — 1), respectively, where 
is the feature representing the existence of word w. However, we use a simplified notation here. 
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sense s than with other sense candidates. This is equivalent to selecting a word w that satisfies 
Equation 2.9, where t is any other sense candidate of the polysemous word under evaluation^^ . 



P{s\w) > P{t\w) (2.9) 

Their objective is to disambiguate adjective senses, or in their case, polysemy of an adjective as 
defined by its antonyms. For example, the word old has two senses, i.e. "not new" and "not 



young". Figure 2.6 shows examples of indicator nouns identified in their paper [|60|]. Note that 
in Justeson and Katz's case, the disambiguation process in itself does not rely on a probabilistic 
model, or more precisely, indicators are used simply as restriction rules. Ng and Lee [105| and 



Pedersen et al. |112] use multiple feature types along with collocating words. The following 
additional features are also usually used^^: 

• the morphological properties of polysemous words (for example, singular /plural in the case 
of a polysemous noun, and tense in the case of a polysemous verb), 

• parts-of-speech of collocating words, 

• syntactic relations associated with polysemous words. 



category 


salient words 


ANIMAL 
TOOLS 


species, family, bird, fish 
tool, machine, engine, blade 



Figure 2.5: Example salient words for two categories 



sense 


indicators 


old ("not new") 
old ("not young") 


world, thing, car, way 
man, people, woman, wine 



Figure 2.6: Example indicator nouns for the adjective old 

The second principle is to compute (approximate) P{x\s), based on the conditional proba- 
bility of feature(s) given sense s. The most simple model, the "Naive-Bayes method", assumes 
that features are conditionally independent of each other given sense s. That is, P{x\s) is 



2.6 



^^In practice, P(s\w ) and P{t\w) are transformed using the Bayesian theorem as for Equation 
^*Ng and Lee |l05|] use these features for exemplar-based word sense disambiguation, which will be described 



in the next "Similarity-based methods" item. 
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approximated simply by the product of each P{Fi = fi\s), as shown in Equation 2.1C. 

~n^(^* = /il«) (2-10) 

i 

A number of past research apphed the Naive-Bayes method to word sense disambiguation [E^, 



96, m 103, iiq, 111^. 



A more complex model, namely the "decomposable model" |l^, 112], considers interdepen- 
dency between different features. Here, let features (^1,^2,^3) represent the input. Suppose Fi 



and {F2,Fs) are interdependent given s, such that P{x\s) can be expressed by Equation 2.11 



Pix\s) = = /lis) • P(F2 = /2,F3 = h\s) (2.11) 

In the decomposable model, the number of parameters to be estimated, which is proportional 
to the number of combinations of values for interdependent features, can be enormous. This 
often leads to the data sparseness problem. In addition, it is difficult to identify which features 



are interdependent given a particular sense. To overcome this problem, Pedersen et al. | 112 ] 
proposed a method of automatically identifying the optimal model (higher performance and 
fewer parameter estimations), by iteratively altering the complexity level of the model. However, 
their experimental results show that identified models did not generally outperform the Naive- 
Bayes method. 



Similarity-based methods Given problems along with their solution (the way they have been 
solved), humans are able to solve new problems based on the analogy of the previously observed 
cases. This analogy-based process has been explored in many AI applications, under the headings 
of case-based reasoning |7^, exemplar-based reasoning memory-based reasoning p9| and 



instance-based learning [|l|^^. A number of word sense disambiguation systems have variously 
applied these methods^^. One critical issue in this has been the computation of similarity 
between an input (a new problem) and examples in the training data (previous problems), and 
thus, we term these methods "similarity-based methods". In the A;-nearest neighbor method (fc- 
NN), one similarity-based method, processing proceeds as follows. First, k examples similar to 
the input are retrieved from the training data. Thereafter, retrieved examples vote on the sense 
of the polysemous word in the input, or in other words, the sense receiving the highest frequency 



15j 
16t 



'Nagao for example, explored the analogy principle in NLP applications. 

'In the case of word sense disambiguation, the reasoning mechanism is relatively simple when compared with 
tasks focused on in AI research (for example, resolution of a "political dispute" focused on by Kolodner [Q), 
because previous problems are merely example sentences containing polysemous words annotated with senses. 
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of annotation with the k examples is selected as the interpretation of the input word. In the 
case of A; = 1, this method is termed the nearest neighbor method, in which the input word is 
disambiguated simply by superimposing the word sense associated with the example of highest 
similarity. Generally speaking, the nearest neighbor method has been used more commonly 
than /c-NN in past word sense disambiguation research^''. In the following, we classify past 
similarity-based methods in terms of the method of computing the similarity (or distance). 
Suppose each training example and the input ("examples", hereafter) are represented by a 



feature vector defined by Equation 2.8, in which one may notice that each example is positioned 
in an A^-dimensional space, where feature Fi corresponds to the i-axis. One common such 
implementation termed the "vector space model" (VSM) computes the similarity between two 
examples by the angle between the two vectors representing the examples. Note that VSM 
has a long history of application in information retrieval (IR) and text categorization (TC) 



tasks [127|. However, in the case of IR/TC, VSM is used to compute the similarity between 



documents, which is represented by a vector comprising statistical factors of content words in 
each document. Formally speaking, the similarity between two examples x and y is computed as 



the cosine of the angle between their associated vectors. This can be expressed by Equation 2.12, 
where x and y are vectors representing examples x and y, respectively. 

X ' y 

sim(x,y) = (2-12) 

m\y\ 

Schiitze [|130(| applies the vector space model to word sense disambiguation, although vectors are 
used for each word sense not individual examples^^. First, Schiitze represents each word by a 
"word vector" , that is, a vector comprising the statistics (e.g. frequency) of its collocating words. 
The collocational statistics are usually collected within a fixed proximity. Then, each context is 
represented by a "context vector" , which is the sum (or "centroid" ) of word vectors related to 



words appearing in the context. Figure 2.7 illustrates the idea of the context and word vectors, 
in which each w denotes a word vector corresponding to a context vector V. It should be noted 
that unlike the vector space model in IR/TC, Schiitze's method returns a positive similarity 
value even when two given context vectors have no words in common (such as with contexts 



1 and 2 in Figure 2.7), that is, two context vectors can be similar when they comprise similar 



word vectors. Then, automatic clustering algorithms [|T^, 27] are used to cluster each polysemous 



Ng |l03| automatically identified the optimal value of k over the range of a given training data, which 
reportedly improved the performance of the nearest neighbor method. 

Within the IR/TC community, this approach is called "text-to-category comparison", and contrasted with 
"text-to-text comparison" which computes the similarity between the input and individual examples. 
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word into word senses, which are also represented by "sense vectors' . In practice, Schiitze 



used the "singular value decomposition" (SVD) technique j^, which finds the major axes 



and reduces the dimension of the vector space. Finally, Equation 2.12 computes the similarity 
between the input and each word sense cluster, to select the word sense with maximal similarity. 



Leacock et al. [78| compared the vector space model, the Naive-Bayes method (see the previous 
paragraph) and neural networks, and reported that the vector space model and neural networks 



slightly outperformed the Naive-Bayes method. Niwa and Nitta |10(:] also explored the vector 
space model, in an implementation resembling Schiitze's method. The vector representation is 
expressed by Equation 2.13| , where V{s) and V{w) are the context vector for sense s and the 
word vector for word w within a fixed proximity from s, respectively. 



V{s) 



w) 



(2.13) 



wdcontext 



In Niwa and Nitta's case, the word vector for w, V{w)., consists of the mutual information 



between w and "origin words" (commonly used words), as in Equation 2.14 



V{w) =<I{w]Ol), I{W]02), I{w]Oi), 



> 



(2.14) 



Here, I{w; Oi) denotes the mutual information between w and origin word Oj. In practice, origin 
words are made up of the 51st to 1050th most frequently used words in Collins English Dictionary 



definitions |133]. Equation 2.12 is used to compute the similarity between two context vectors. 
However, unlike Schiitze's method, which averages context vectors into sense vectors, Niwa and 
Nitta prefer to use the "text-to-text comparison" method: they compute the similarity between 
the input and individual examples. 

One may argue that the "Euclidean distance" between two examples in A^-dimensional space 
can also be applied to the similarity computation (as performed by Aha et al. [Q] for instance- 
based learning). However, this similarity measure seems less effective for word sense disambigua- 
tion, because Euclidean distance is sensitive to vector length, which is usually proportional to 
the frequency of collocating words. In other words, Euclidean distance usually assigns lower 
similarities to frequently appearing word senses. 



Ng and Lee |105| ], following Cost and Salzberg |25], used a different similarity measure. 
In their method, the distance between two examples is computed by summing the distances 
between the feature values associated with those examples. In other words, two examples are 



^^Schiitze used two different clustering algorithms, depending on the data size. 
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Figure 2.7: The word and context vectors for two contexts 



similar when they have feature values that roughly correspond in distribution as obtained from 
the training data. The distance between two feature values /i and /2 of a feature F is computed 



as shown in Equation 2.15 



distih, /2) = ^ \Pis\F = h) - P{s\F = h)\ (2.15) 

i 

Here, P{s\F = fi) is the conditional probability of sense s given feature F takes value /i. 
P{s\F = f2) denotes a similar probability for value /2 of feature F. Ng and Lee |105[ | used mul- 
tiple features described in the previous item (see the item "Probabilistic models"), and showed 
its effectivity over the use of single type feature. Their results suggest that feature selection is 
also a crucial task in the similarity-based methods. 



Cho and Kim [19| used "relative entropy", which estimates the degree to which two proba- 
bility distributions differ, for disambiguation of verb sense. They used only one feature, in that 
each example is represented simply by the object case noun for the target polysemous verb. The 
similarity between examples is computed based on the distribution of their associated nouns. 



Cho and Kim represented the distribution of noun n by Equation 2.16, where P{vi\n) is the 
conditional probability that verb Vi appears given noun n. 

din)=<P{vi\n), Piv2\n), P{vm\n)> (2.16) 



Given two examples associated with distributions d{ni) and d{n2), the similarity between them 
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is computed by Equation 2. 17^*^. 



sim{d{m),d{n2)) = -J2 ^Kl^i) log ^Pt^ (2-17) 

^ P{Vi\n2) 

All the methods reviewed so far compute the similarity based on the statistical distribution 
obtained from training data. The remainder of this section is devoted to reviewing methods 
relying on hand-crafted resources (mostly thesauri) for the similarity computation. Here, let us 
assume that sentences (1) (see page 10) are examples incorporated into training data, and that 
we have the following sentence as input. 

(2) The company employed the graduate. 

One may notice that the sense of employ in the input would be "to hire" , because the subject and 
object of employ in the input are semantically closer to those in (1-a), respectively, than to those 
in (1-b). This is where hand-crafted thesauri can be applied, based on the intuitively feasible 
assumption that words located near each other within the structure of a thesaurus have similar 
meaning. In practice, most methods heuristically predefine the relation between the similarity 
between nouns, and the length of the path between them in the thesaurus structure. Kurohashi 



and Nagao WT^ used the Bunruigoihyo thesaurus [102] for disambiguating senses of Japanese 



verbs. Uramoto [142| used LDOCE |114] for disambiguating senses of English verbs. Note that 



complements of verbs in the training data can be considered as an extensional description of 
selectional restriction. In other words, instead of merely ruling out inappropriate case filler 
nouns, example complements estimate the degree that input complements satisfy the restriction 
imposed on them. It should also be noted that lexical ambiguity of complement nouns can be 
resolved in the same manner as previously demonstrated for the disambiguation of facility in 
(1-a). One may notice that company in the input can be interpreted as "installation" because 
facility in (1-a) is also related to "installation", and therefore these two words are close vicinity 
within located in the thesaurus'^^. Li et al. |81] and Lin [p2[ (independently) explored this notion. 



In both cases, the similarity between two words is computed based on the taxonomy defined in 
WordNet [^]. Li [^] heuristically predefined the relation between the length of the path in the 
taxonomy and the similarity. Lin used a more formal measure resembling the information- 



theoretic taxonomy-based similarity measure proposed by Resnik |121| , |122|] (see Section for 
details). 



■^"in practice, Cho and Kim |19[ used a smoothing technique in the case of an occurrence of P{vi\n) = in the 
given vectors. However, we do not further describe the tec hniq ue utihzed. 



Regrettably, Kurohashi and Nagao |77| and Uramoto 142 1 did not make significant comment on this point 
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2.2.2 Restricting supervision 

Corpus-based approaches have recently pointed the way to a promising trend in word sense 
disambiguation. A number of experiments have shown that the performance of word sense 
disambiguation can be significantly improved by enhancing the volume of supervised exam- 



ples [96, 104]. However, to do this, supervised methods require considerable manual annotation 
in supervising large-sized training data sets. To resolve this problem, unsupervised and semi- 
supervised methods, which (semi-) automatically acquire annotated training data, have been 
variously explored. On the other hand, one may wonder if systems without human supervision 
can perform to a reasonable level. Below, we survey different approaches for restricting reliance 
on supervision. In addition, despite the promising features of each method, we discuss problems 
and open questions related to the various approach types. 



Bilingual corpora Based on the observation that different word senses in a given language 
can correspond to distinct words in other languages, Dagan and Itai [^] used bilingual corpora 
for word sense disambiguation (they used Hebrew-English and German-English language-pair 
corpora, respectively). In their case, word polysemy in the source language is defined based on 
the existence of separate translations in the target language. In other words, the objective of 
this research can be seen as translating a source language sentence to a target language sentence 
on a word-to-word basis. The method proceeds as follows: 

1. the source language corpus is parsed to extract syntactic tuples (source syntactic tuples), 

2. a bilingual lexicon is used to generate alternative target syntactic tuples for each source 
syntactic tuple, 

3. the target language corpus is parsed to extract target syntactic tuples, which are used to 
evaluate the plausibility of each alternative target tuple generated in the previous step, 

4. the target sentence maximizing the combined plausibility of target syntactic tuples is 
generated. 

For example, the Hebrew tuple "'higdil sikkuy" (verb-obj) containing the polysemous word higdil 
is associated with three English tuples: "increase chance", "enlarge chance" and "magnify 
chance". However, the polysemy can be resolved by selecting the tuple which is most likely 
to occur, base on the target corpus, i.e. "increase chance". It should be noted that Dagan and 
Itai's method uses information extracted independently from the source and target language 
corpora, which means that the two corpora may not necessarily be translations of one another. 
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and that manual supervision is avoided. This property of potential non-correspondence leads to 
a salient contrast with the method proposed by Brown et al. which uses bilingual aligned 
corpora. The applicability of bilingual corpus-based methods is relatively limited, that is, they 
are restricted to translation-oriented applications. Apparently, word sense distinctions differ 
depending on application, and therefore (as identified by Krovets [75|, for example) word sense 
disambiguation aimed at machine translation is not necessarily useful for information retrieval. 

Bootstrapping The basis of "bootstrapping" is, given an initial training data set (usually 
consisting of a small number of annotated examples) , to progressively enhance the training data 
by iteratively acquiring presumably correctly annotated examples from previous disambigua- 
tion results. As can be imagined, this method often ends up incorporating noise (incorrectly 
annotated examples) into the training data. To avoid this problem, Hearst used a rela- 
tively small number of supervised examples as initial training data. Yarowsky |157] completely 



excluded manual supervision by automatically acquiring the initial training data set from a dic- 
tionary. In addition, Yarowsky used discourse constraints to exclude noise from the decision list. 
To put it more precisely, when significantly large number of examples associated with a given 
discourse are annotated with a common sense in the training data, all examples associated with 
that discourse are standardized to the same sense annotation. Yarowsky's experimental results 
show that the performance of this method is equivalent to that achieved by supervised learning. 
However, one controversial issue would be the applicability of bootstrapping to the dis- 



ambiguation of finer-grained word sense distinction [105], in that this method has generally 
been applied only to coarse-grained binary word distinctions, such as sake ("benefit" /"Japanese 
liquor"). Karov and Edelman [^3| used bootstrapping to automatically enhance word sense 
classifiers using dictionary definitions and a corpus. As with similar research, their method also 
focused on coarse-grained binary sense distinction, such as suit ( "court" / "garment" ) . To sum 
up, the effectivity of fully-automatic bootstrapping for the disambiguation of fine-grained sense 
distinction remains an open question. 



Spreading noise through monosemous w^ords Yarowsky |154|] projected polysemy onto 



word categories defined in Roget's thesaurus |14|, and trained statistical classifiers for categories 



rather than individual words. Consequently, monosemous words associated with each category 
provide reliable co-occurrence statistics, and noisy statistics arising from polysemous words 
should be relatively dispersed (tolerable), without any supervision in sense tagging. For example, 
the classifier for the category ANIMAL is expected to be reliably trained by monosemous words 
like sparrow, although polysemous words like crane potentially introduce a certain proportion 
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of noise. Luk |83| adopted a similar approach to obtain statistics about word senses defined 



in LDOCE |114]. While these methods completely remove any overhead for supervision, the 
applicability of the methods is limited to word senses mapped onto thesauri, where word entries 
are associated with categories. This property does not apply to many dictionaries. 



Automatic clustering Schiitze 130 | reduced manual supervision by using automatic cluster- 



ing algorithms (proposed by Cheeseman et al. |18| and Cutting et al. |27], respectively). First 



clustering algorithms are used to divide the training data into a certain number of clusters. 
Thereafter, a human expert examines a small number of examples (from 10 to 20) contained 
in each cluster, which are applied in determining an appropriate word sense for each cluster. 
Thus, strictly speaking, this method does not constitute supervised learning. Given an input, 
the cluster (word sense) with the maximum similarity to the input is selected as the interpreta- 
tion (see Equation 2.12| for this computation). Regrettably, since reported experiments have not 



compared this semi-supervised method with fully-supervised learning methods, it is not possible 
to ascertain whether automatic clustering algorithms are expected to reduce manual supervision 
without degrading the performance demonstrated by supervised methods. 

Pedersen and Bruce [ [L10|| used automatic clustering algorithms relying on McQuitty's simi- 



larity analysis and Ward's minimum- variance method |148|, respectively. Their training/test 



data includes polysemous nouns, verbs and adjectives collected from the ACL/DCI Wall Street 



Journal corpus [g^, in which each word is annotated with a single sense defined in LDOCE |114] 
or WordNet Their comparative experiments showed that automatic methods still find it 

difficult to match the accuracy achievable with fully-supervised learning methods in the disam- 
biguation of relatively fine-grained polysemy (respective accuracies of roughly 66% vs. 84%). 
They also tested the expectation maximization (EM) algorithm [31| for unsupervised learning, 
which resulted in an accuracy of about 63%. 

Linguistic behavior Justeson and Katz [^0| proposed a method to automatically acquire 
training examples for the disambiguation of adjective senses. As described in Section 2.2.1, in 
this case, polysemy is projected onto adjective antonyms. Their acquisition method uses the 
following three principles: 

(a) antonyms often co-occur during direct comparison, in contrastive opposition, 

(b) antonyms are frequently joined by and or or, 

(c) antonyms frequently appear in noun phrases joined by prepositions and with the 
same head noun. 
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Each principle can be understood through the following example sentences presented in their 
paper [^0|, which contain the adjectives old ("not new" /"not young") and hard ("not easy" /"not 
soft"). 

(a) They indicated that no new errors were being made and that all old errors would 
be corrected within 60 days. 

(b) It was pitiful to see the thin ranks of warriors, old and young . . . 

(c) But there is no sudden transition from hard to soft. 

The corresponding principles annotate the instances of old and hard in sentences (a), (b) and (c) 
with the senses "not new" , "not young" and "not soft" , respectively. These training examples are 
used to identify indicator nouns (see Section 2.2. 1| ). While this method successfully performs 



on these examples, the following issues must be considered. First, the above principles can 
be applied only to adjective sense disambiguation in which senses are mapped onto concepts 
associated with adjective antonyms. Second, the coverage of this method is relatively small: 
Justeson and Katz used a corpus consisting of 1.5 million sentences, and could obtain only 
about 1500 co-occurrences of a target adjective and antonym. Besides this, indicator nouns 
identified from these co-occurrences could disambiguate only about 27% of (open) test inputs. 

Machine readable dictionaries Dictionaries provide definitions (and in some cases example 
sentences) for each word sense, which contain a number of "clue words" . Let us take the following 
sentences as example definitions for different senses of the word bank ("river edge" / "financial 
institution"): 

(3) a. land along side of river /lake, ("river edge") 
b. place money kept, ("financial institution") 

As can be seen, these definitions contain clue words, such as river or money, which are intuitively 
associated with the respective senses of bank given above. Supposing a given input contains 
bank and river, one can easily select the former sense for the interpretation of the input bank. 
A number of methods follow this intuition [p6| , |47| , 63, 79, 108, 151]. Given an input, these 



methods generally compute the number of (clue) words appearing in the input and definitions, 
as the score for each sense candidate. Thereafter, they select the sense with the maximum score. 
A variation of these methods is to normalize the score by the length of the input. 

Methods of this type can be categorized as instances of similarity-based methods (see Sec- 
tion 2.2.1| ) , because the score represents the similarity between the input and each sense candidate^'^ . 



Other methods, such as the Naive-Bayes method, can also use clue words as a feature set. 



26 



CHAPTER 2. PAST RESEARCH ON WORD SENSE DISAMBIGUATION 



Unlike supervised methods, these methods do not involve an excessive overhead for large-scaled 
sense annotation^^. 

However, the quality and quantity of definitions, which highly influence performance, are 
problematic for these approaches. Definitions are relatively arbitrary, and thus different dic- 
tionaries often provide different definitions for the same sense. This problem is less crucial for 
human users because they can rely on external knowledge. However, lacking sufficient knowl- 
edge, computers can potentially fail to perform correctly given undesirable definitions. As for 
the quantity problem, it is hard to obtain adequate statistical information because definitions 
usually contain small numbers of words, or in some cases simply comprise synonym words. The 
method proposed by Luk []83| , in which an outside corpus (the Brown corpus) was combined 
to enhance the statistics, would be one solution to this problem. Co-occurrence statistics were 
obtained in terms of a "control vocabulary" obtained from LDOCE |lf4| ]. However, this method 
requires additional manual compilation of the control vocabulary set (pruning rarely occurring 
words, for example). Okumura and Matsunaga |108| ] also overcame the quantity problem by 
expanding definitions through the EDR thesaurus [^]. However, these general methods do not 
directly address the quality problem. To sum up, we claim that the use of dictionary definitions 
is not a stand-alone approach: definitions can be used as an initial resource, but must be com- 
bined with additional (supervised) training data. This claim was experimentally validated, as 
described in Chapter |^. 



2.3 Evaluation Methodology 

From a scientific point of view, performance evaluation is invaluable. The procedure to evaluate 
a dedicated method is fundamentally to simulate run-time usage of the method by providing a 
corpus as training/test data. However, the performance of corpus-based methods is generally 
strongly biased by the training data provided, as well as test data. To minimize this bias, most 
experiments iteratively carry out the same trials, and average the results derived from each trial. 
In each trial, a fixed number of examples are randomly selected from the corpus, as the train- 
ing and test data, respectively. This evaluation methodology can be called "cross validation". 
Strictly speaking, the performance of methods under evaluation is inherently biased by the given 
corpus. However, since collecting broader coverage sense-annotated corpora requires tremendous 
human labor, most researchers have conducted experiments using relatively smaller-sized cor- 

■^^ Strictly speaking, human lexicographers need to provide sense definitions, a process which is associated with 
considerable overhead. However, given that a number of machine readable dictionaries are currently available, 
word sense disambiguation system developers are virtually released from the task of establishing sense definitions. 
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pora at hand at the time. Another problem is that since interpretation of "performance" seems 
to vary according to the individual, a number of separate evaluation criteria exist. The following 
sections detail evaluation methodologies used in past word sense disambiguation methods. 



2.3.1 Common evaluation criteria 

One common evaluation criterion is the degree to which a method can be applied to run-time 
inputs, that is, the "coverage" . This criterion is crucial especially for qualitative approaches 
(see Section 2.2. 1| ), because naive rules often reject all the sense candidates and therefore no 



decision can be made. Additionally, the ratio between the number of cases where the correct 
decision was made and the total number of decisions made, termed as "accuracy" , is commonly 



used. These two criteria can be summarized as in Equations 2.15. 



_ # of decisions made 
^ ~ total # of inputs 

(2.18) 

_ # of correct decisions made 
accuracy — total # of decisions made 

It should be noted that these criteria can also be applied to quantitative approaches. Given 
the degree of confidence about the decision, the method can purposefully refuse to make a risky 
decision. A number of instances of this principle can be found in the evaluation of quantitative 
word sense disambiguation methods. Dagan and Itai estimated the degree of confidence by 



the statistical significance of the given training data | 153 p^. Li et al. [Bl| empirically defined the 
relation between their heuristic rules and the confidence degree. In these cases, coverage and 
accuracy are contradictory criteria. However, most recent experiments have commonly employed 
a "backing-off" strategy (for example, random choice) when no decision can be made, resulting 
in a coverage of virtually 100% and performance being evaluated simply through accuracy. One 
plausible reason for this approach is that a single criterion is more intuitively understandable 



than multiple criteria. In fact, in the case of Li et al. |81] mentioned above, the trade-off between 
coverage and accuracy can be interpreted as the validation of the definition of confidence degree, 
rather than the performance of the method itself. 

One may notice that coverage and accuracy resemble "recall" and "precision", respectively, 
which have been commonly used in performance evaluation of information retrieval (IR) and 



^■'The term "coverage" can have different interpretations. One example is the degree to which a system restricts 
the type of input (speech/on-line text). However, we confine interpretation of coverage to the narrower meaning 
given in the text. 

^^Dagan and Itai ji^] used the criteria of "applicability" and "precision", which correspond to "coverage" and 
"accuracy", respectively, in this section. 
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text categorization (TC) tasks. In fact, the definition of recall/precision in experiments carried 



by Ribas [124], for example, is almost the same as that of coverage/accuracy, as given above. 
Let us devote a little more space to explaining the notion of recall and precision. In the case 
of IR, recall favors systems that retrieve as many documents salient to a user query as possible 
(disregarding noise contained in the retrieved data), while precision favors systems that retrieve 
as few irrelevant documents as possible. As can be seen, when all the documents are retrieved, 
recall is always 100%, potentially sacrificing precision. On the other hand, in the case of TC, 
recall favors systems that assign as many correct categories to each document as possible, while 
precision favors systems that assign as few incorrect categories to each document as possible. 
Note that since a single document can usually be associated with multiple categories according 
to different points of view, evaluation relying on a single criterion, i.e. accuracy, is less effective. 
The notion of these two contradictory criteria can be generalized as follows. Let us assume a 
situation in which a subject has to answer "yes" or "no" to N given questions (correct answers 
are not given to the subject, of course). The results can be classified into four cases as shown 
in Table |2.8| , which is called a "contingency table" . In this table, capital letters (from A to D) 
denote the number of instances associated with each case. Correspondingly, recall and precision 



are defined as in Equation 2.19. 

recall = (J 

(2.19) 

■ ■ A 

precision = j^l^ 

Note that in the case of IR, a question corresponds to a document, and "yes" means that the 
document is retrieved. Therefore, A+B and A+C denote the number of retrieved documents and 
salient documents, respectively. On the other hand, in the case of TC, a question corresponds 
to each combination of document and category. Thus, A+B and A+C denote the number of 
categories assigned to documents and correct categories assigned to documents, respectively. We 
may note, in passing, that to integrate recall and precision, "F-measure" , which is expressed by 
Equation 2.20| , is often used as an evaluation criterion. 

„ + 1) • recall • precision 

F-measure = -^--^ — — (2.20) 

p ■ precision + recall 

Here, /3 is a parametric constant used to control the preference between recall and precision. 
One may notice that as a type of categorization task, word sense disambiguation can equally 
be evaluated as performed for text categorization. However, most researchers coarsely assign a 
single sense to each word, and seem to prefer accuracy as the evaluation criterion. 
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Figure 2.8: Contingency table 



Finally, let us describe several entropy-driven criteria, which have not been used for the 



past evaluation of word sense disambiguation. As Resnik and Yarowsky [123] identified, "cross- 
entropy", which has commonly been used for the evaluation of speech recognition systems, can 
also be used to evaluate probabilistic word sense disambiguation systems. Intuitively speaking, 
this criterion allocates a higher score when the system assigns a higher probability to a correct 
answer. Kononenko and Bratko Jt^ ] consider the prior distribution of candidate classes as well 
as the posterior distribution, that is, a higher score is allocated when the system improves on 
the prior probability for the correct answer. This criterion is expected to make it easier to 
compare results derived from different test data collections. The same motivation can be found 



in the evaluation methodology proposed by Gale et al. |42|, in which the system performance 
was compared with the lower bound performance gained through the prior distribution of word 
senses. 



2.3.2 Multiple human judgement 



In using any of the evaluation criterion described in Section 2.3.1, the system decision must 
be compared with the correct answer. In most cases, human judgement provides the "correct" 
answer. However, there are cases where even human experts cannot correctly perform word sense 
disambiguation, because of differing interpretation of word senses between individuals [42, pT 



Consequently, system evaluation is biased by human judgement. One can expect to minimize 
this bias through the use of multiple human experts, that is, by validating the judgement of one 
human through evaluation by other humans. Although the associated cost is greater than for 
a single human judge, there have been a number of past cases of multiple judges being used in 
word sense disambiguation research^^. 



Ravin [118[ asked multiple humans (including both linguists and non-linguists) to disam- 
biguate the test data which was intended for subsequent system evaluation, and used only those 
inputs where humans agreed on the answer. Gale et al. l43[ also used multiple human judges 



^^Note that multiple human judges have also been used in the evaluation of other applications, such as speech 
recognition systems. 
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to estimate the upper bound on their system's performance. By "upper bound", they meant 
human performance on disambiguation, which can be evaluated by way of comparison with 
other human judgements. To estimate the upper bound more rehably, they conducted the same 
trials by iteratively changing the human under evaluation, to average the performance of differ- 
ent humans. Hatzivassiloglou and McKeown used multiple human judges to directly score 
individual decisions made by their system (although they used this methodology to evaluate the 
performance of adjective clustering). Intuitively speaking, their system was attributed a higher 
score when it made a correct decision which more humans agreed on the answer to. 



2.3.3 Empirical comparison 

Empirical (quantitative) comparison of a proposed method with various existing methods is 
a common evaluation methodology, assuming some well-defined evaluation criterion. In most 
cases, proposed (new) methods are compared with lower bound techniques, that is, in the case of 
word sense disambiguation, to systematically choose the most frequently appearing sense in the 
training data |42|. One aspect lacking in the empirical comparison of word sense disambiguation 
systems is that there is no contest for systems, as exists for information extraction systems in 
the Message Understanding Conferences (MUCs). However, a number of extensive comparisons 
involving a relatively large number of different methods can be found in past experiments. Let us 
devote a little space to describe these experiments, which fundamentally used the accuracy as a 
evaluation criterion (some of the below disambiguation methods are described in Section 2.2.1| ). 

In the case of word sense disambiguation for English, the ACL/DCI Wall Street Journal 
corpus 1^], in which each word is annotated with a single sense defined in LDOCE [114] or 
WordNet |93], has commonly been used. Leacock et al. [^] compared the vector space model, 
the Naive-Bayes method and neural networks, and reported that the Naive-Bayes method is 



marginally outperformed by the two other methods. Mooney |96| reported that the Naive- 
Bayes method outperformed, variously, neural networks, C4.5, /c-NN (A; = 3), and variations 



of the "FOIL" rule induction algorithm [116]. An experiment conducted by Ng [103[ showed 
that "PEBLS" [^] and the Naive-Bayes method are comparable in performance. Pedersen 
and Bruce [ |110([ compared three unsupervised methods: two automatic clustering algorithms 
(namely, McQuitty's similarity analysis [^^ and Ward's minimum- variance method |14g| ]) and 



the expectation maximization (EM) algorithm [31[. In the experimental results, McQuitty's 



similarity clustering outperformed the other two methods. Pedersen and Bruce [111[ reported 
that C4.5 slightly outperformed all of the Naive-Bayes method, PEBLS, CN2 and variations of 



their decomposable model [|13|, p.l2 [. 
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The relative results derived for these above experiments differ because each method involves 
separate parameters, which can vary between experiments. For example, a feature set, which 
would be inconsistent between most of the methods above, is highly influential in performance. 

Fujii et al. compared a number of different methods using the EDR Japanese corpus 
According to their results, the performance of the compared methods can be sorted in descend- 
ing order as follows: the thesaurus-driven similarity-based method'^^, the vector space model, 
the Naive-Bayes method and the qualitative method using selectional restriction. To derive 
more general results, Fujii et al. used a different type of criterion along with the accuracy. 
Instead of the conventional binary (correct /incorrect) judgement (i.e. accuracy), they applied 
an evaluation criterion based on the semantic similarity between word sense candidates, as advo- 
cated by a number of researchers [ p2| , |123|] . To exemplify this notion, let us take the polysemous 
Japanese verb tsukau, which has multiple senses in EDR, such as "to employ", "to operate", 
"to spend" and "to use MATERIAL". These senses are associated with the EDR thesaurus 
Figure shows a fragment of the thesaurus, in which an oval denotes a word sense. As with 
most thesauri, the length of the path between two word senses can be seen as the relative se- 
mantic distance between them. As one can see, the verb sense "to spend" is physically closer to 
"to use MATERIAL" than to "to employ" or "to operate", in structure. In fact, these two prox- 
imal verb senses can be merged into one common category, that is, "to use up". Furthermore, 
they can be merged with "to operate" to form "to use PHYSICAL OBJECT" , as distinct from the 
remaining sense of "to employ (HUMAN/CONCEPT)". Let the correct sense of an input tsukau 
be "to use MATERIAL", and assume the system incorrectly outputs "to spend". In this case, 
the error would be more acceptable than outputting "to employ", which is totally different to 
the correct interpretation of "to use MATERIAL" . Therefore, we should allot the system a scaled 
acceptability factor instead of a score of zero. Note that the binary judgment simply scores 
the system zero, irrespective of the extent to which the error is acceptable given a particular 
practical application. Fujii et al. formalized the scaled acceptability factor ("acceptability". 



hereafter) as in Equation 2.21 



A(xc) ( MAXLEN-EDRix,s) y 

^^"^^''-y MAXLEN ) ^^-^^^ 

Here, x and s are the system's interpretation and the correct answer, respectively, and A{x,s) 
is the acceptability of the given x and s. EDR{x, s) represents the length of the path between 
x and s in the EDR thesaurus. MAXLEN is the maximum length of the path between senses 



Fujii et al. [M] used the Bunruigoihyo thesaurus |l02j for the similarity computation. 
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associated with an input verb. For example, MAXLEN = 7 in Figure |2.9| (the length of the 
path between "to operate" and "to employ"), a is a parametric constant, which acts to control 
the reduction factor of A{x, s) for incorrect interpretations. With a larger a, A{x, s) becomes 
smaller for incorrect interpretations, and becomes closer to the binary judgment. One can notice 
that the acceptability ranges from (where x and s are the most dissimilar verb senses) to 1 (x 
and s are identical) . The result derived through the acceptability with different a values showed 
the same tendency as the accuracy. 




Figure 2.9: A fragment of the EDR thesaurus including senses related to the Japanese verb 
tsukau 



2.3.4 Further considerations 

This section discusses several further issues which should be noted for the evaluation of word 
sense disambiguation systems. 

The first issue is that the granularity of word senses differs depending on the viewpoint. This 
problem is closely related to the problem that different viewpoints result in different distinctions 



of word senses (as discussed in Section 2.3.2 ). Consequently, the performance of the system is 
likely to be degraded when finer-grained sense definitions are used for the evaluation. Conversely, 
the system would generally perform better for applications that require coarser-grained sense 
definitions. The best way to minimize this bias is to use multiple word sense definition paradigms, 



with different levels of granularity. Lin |82] used multiple granularity levels by gradually relaxing 
sense distinction based on the taxonomy of WordNet [^3|. To put it plainly, one can simulate 
coarse-grained sense definition by regarding distinct senses dominated by the same parent node 
(synset) in the taxonomy as the same sense. The experimental results showed that Lin's method 
outperformed the lower bound method as the granularity of sense distinction was lessened. 
However, at the same time, multiple evaluation criteria can complicate the interpretation of the 
performance. 
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Second, one may notice that the most straightforward evaluation type is "task-based eval- 
uation", where the criterion is the degree to which the system improved on the performance 
of another application. However, a surprisingly small number of cases of task-based evaluation 
can be found in past word sense disambiguation research. Brown et al. [|lO| improved on the 
performance of French-English machine translation by introducing their word sense disambigua- 
tion method. Fukumoto and Suzuki [^] reported that application of word sense disambiguation 
improved on the performance of text categorization^^. Voorhees [147] used word sense disam- 
biguation (with respect to WordNet senses) to enhance query terms for information retrieval 
(IR), which unfortunately did not improve the performance of a conventional IR system. One 
reason for this result would be the rudimentary nature of existing word sense disambiguation 
methods, especially for fine-grained sense definitions. This poses the question as to what degree 
word sense disambiguation should be correctly performed for operational applications. Exper- 
iments conducted by Sanderson |128| ] aimed at an to IR system application provide a pointer 
to the answer to this question. Sanderson carried out experiments as follows. First, due to 
the lack of large-scaled sense-annotated data, artificial polysemous words, namely "pseudo- 



words" |130, 155] are created, based on words appearing in a given document collection. For 
example, distinct words like guerrilla and reptile are considered to be two distinct senses of a 
pseudo-word. In other words, pseudo-words simulate extremely coarse-grained lexical ambiguity. 
Based on this, one can artificially alter the performance of (virtual) word sense disambiguation 
systems. For example, a system with 50% accuracy can be simulated by knowingly selecting 
incorrect senses for half of the pseudo- words contained in the document collection. The results 
show that (a) a theoretical perfect WSD system improved on the baseline performance of the 
IR system (for which polysemy of pseudo- words remains unresolved), and (b) WSD system with 
90% accuracy rate did not improved on the baseline performance, which means, at a bare min- 
imum, more than 90% accuracy is required to enhance IR systems. At the same time, we note 
that the applicability to practical issues of this conclusion derived from the artificial evaluation, 
is arguable. 



2.4 Related NLP Research 

Let us discuss NLP research which is expected to enhance word sense disambiguation. 



^^Schiitze ]l29| ] improved the quality of information retrieval through word sense discrimination, which differs 
from word sense disambiguation in that the former does not use predefined sense candidates. 
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Morphological analysis As reviewed in Section 2.2.1, past word sense disambiguation metli- 
ods generally rely on the morphological content of words in the input (or training data), such 
as the root form or part-of-speech. In addition, in the case of agglutinative languages, such as 
Chinese and Korean, lexical segmentation is also poignant^^. Note that numerous types of mor- 
phological analyzers (in various languages) with reasonable performance have been established 
for easy access. 



Syntactic analysis Syntactic analysis can enhance the performance of word sense disam- 
biguation methods which rely on syntactic relations (associated with polysemous words). This 
analysis is especially poignant when the input comprises a complex sentence. Note that given the 
fact that full parsing is still expensive, a number of methods for partial syntactic analysis have 
also been applied to identify syntactic relations associated with polysemous words ll^, |15S[|. 



For Japanese verb sense disambiguation, Pujii et al. |3£] used two simple heuristics for extracting 
verb complements^*^ , rather than syntactic analysis on the whole input sentence. Note that verb 
complements are useful features for verb sense disambiguation. The heuristics utilized are given 
below: 

• each complement is associated to the verb of highest proximity, 

• complements containing the genitive case marker no are not considered because they can 
constitute either possessive or nominative case markers, and are thus confusing. 

They reported that the performance of verb sense disambiguation combined with these heuristics 
is comparable to that combined with a full syntactic analysis. In other words, the overhead 
required for syntactic analysis can be reduced without degrading the system performance. 



Identification of idiomatic expressions Idiomatic (fixed or freezed) expressions, in which 
a specific word collocation pattern stands for a certain meaning, is one obstacle to word sense 
disambiguation. For example, similarity-based systems generally fail to interpret inputs where 
polysemous words comprise elements of idiomatic expressions, by semantic overgeneralization 
through the use of a thesaurus. Possible solutions would include one proposed by Uramoto |142[| , 
in which idiomatic expressions are described separately in the database so that the system can 
control their overgeneralization. At the same time, given the fact that there is no universal 



Continuous speech inputs pose the same problem for non-agglutinative languages. 
^"^In Japanese, a verb complement consists of noun phrase (case filler) and postposition (case marker sufBx). 
See Section B.2 for details. 
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consensus as to idiomatic expression types, automatic methods to identify idiomatic expres- 



sions [20, 134 1 may also prove useful. 



Discourse analysis The discourse or domain dependency of each word sense, as been used 



in a small proportion of past word sense disambiguation techniques [47, 101 , 157 [, can provide 
strong external constraints. Given a single discourse segment, polysemous words tend to coincide 
in interpretation. Anaphora/ellipsis resolution [^ , |65| , |97| is also expected to enhance the 
contextual information. Let us take the following example, in which the pronoun it refers to 
taxi in the previous sentence: 



(4) A taxi is coming. Let's take it. 

One may notice that correct anaphora resolution would make it easier to disambiguate the 
inherent polysemy of take in the second sentence. 



Establishment of linguistic resources As KilgarrifF [^[ identified, lexicography is both a 
benefactor and beneficiary for word sense disambiguation. That is, dictionaries provide sense 
candidates for polysemous words and clue words for each sense. In addition, automatic identifi- 



cation of word senses [41, 144, [158| (perhaps combined with lexicography) is expected to avoid 
human bias in sense distinction. 

Establishment of thesauri is also useful for a number of thesaurus-driven word sense dis- 
ambiguation methods [^, [ 



124, |125| , p.42| , 154]. Conventional methods for automatic 



thesaurus construction have utilized dictionary definitions to extract IS-A or hyper/hypo rela- 
tions [^, 92, 100 1 . However, given the number of electronic thesauri currently available, the 



recent trend seems to be to focus on enhancement of existing thesauri (not development from 
scratch). Besides this, given the fact that most thesauri aim at general purpose applications, 
they need to be adjusted for a particular usage 0, g. Tokunaga et al. |l3| and Uramoto p^] 
(independently) extended existing thesauri by appropriately positioning unregistered words in 



the taxonomy. Hearst and Schiitze [52| proposed a method to adjust the general taxonomy 
defined in WordNet [^] for particular usages (although they targetted the application toward 
text categorization systems). 

Finally, the establishment of a sense-annotated corpus as a benchmark collection is inevitable 



for the standardized evaluation of word sense disambiguation methods [g4|, |9^ . 
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2.5 Discussion 



Pustejovsky and Boguraev [115] identified one limitation of current word sense disambiguation 
methods in their dependence on fixed numbers of predefined word senses, and that therefore 
a novel word usage cannot be considered. This limitation can be especially problematic when 
one targets an all-purpose system, because exhaustive sense enumeration seems practically im- 
possible. However, focusing on a particular domain (sublanguage), where vocabulary size is 
relatively limited, this problem is expected to be resolvable. One may argue that compilation 
of word senses for different domains poses a considerable overhead. However, automatic word 
sense identification pT| , 144, 158 1 is expected to reduce this overhead. 

In addition, Basili et al. |^, ^ proposed a method to tune word senses taken from a general 
purpose lexicon (in their case, WordNet |93|) to a specific sublanguage. They first (manually) 
identified some of higher level word classes in WordNet ||9^ as "kernel senses" (25 for nouns 
and 15 for verbs), for each of which representative verbs are automatically identified. Then, 
a sublanguage corpus is used to train statistical classifiers for each kernel sense^^. Finally, 
statistical classifiers evaluate the membership of each (prospective) polysemous word to the 
kernel senses. Consequently, novel senses are acquired and irrelevant senses specified in the 
sublanguage are discarded. 

Thesaurus extending methods [138, 143] can also potentially resolve the novel sense problem. 
Tokunaga et al. [138[ used the "SVMV model" [^] to train probabilistic classifiers of thesaurus 
nodes (they use the Bunruigoihyo thesaurus ]102] as the core thesaurus). Each node is repre- 
sented by the co-occurrence vector associated with words belonging to the node. Given a new 
word unlisted in the thesaurus (which is also represented by a vector), classifiers compute the 
probability that the word belongs to each node, and the word is positioned in the node with 
maximum probability. One may notice that this technique is equal to positioning a novel usage 
of a word in a conceptual hierarchy (such as WordNet ]Q or EDR [|58|l)'^^. 

At the same time, we concede that this issue still needs to be further explored, and do not 
pretend to draw any premature conclusions in this research. 
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Basili et al. 



used a statistical method proposed by Yarowsky 154 1 for the classifier training. 



^^One may argue that, strictly speaking, the number of word classes (senses) defined in a thesaurus is already 
limited, that is, novel senses undefined in thesauri cannot be considered. The answer to this problem remains as 
an open question. 
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2.6 Summary 

In this chapter, we surveyed past research associated with word sense disambiguation. First, 
we elaborated on different classes of corpus-based word sense disambiguation methods, which 
consists of qualitative approaches (selectional restrictions, decision trees and decision lists) and 
quantitative approaches (probabilistic models and statistically/thesaurus-driven similarity-based 
methods) . We identified that one critical content is selection of useful features (clues) for word 
sense disambiguation. In addition, the performance of the similarity methods strongly depends 
on the way to compute similarity between given examples. We also identified that past methods 
generally use corpus, where each polysemous word is associated with correct sense, as source 
knowledge. Past experiments showed that large-scaled corpora are required for operational 
systems. In other words, a considerable overhead is required when one tries to supervise, i.e. 
manually annotate word senses to large corpora. In view of this problem, we second investi- 
gated how past research has tried to minimize the overhead required for human supervision. 
While past methods reportedly successfully minimized (or completely excluded) the overhead 
for supervision, we identified that applicability of past methods seems to be limited to a certain 
specified usage, such as machine translation oriented system [^8|, coarse-grained word sense dis- 



tinction ||63|, 13C, 157 1, and disambiguation of adjective senses [60|. Third, from a scientific point 
of view, we described evaluation methodologies for word sense disambiguation, which turned out 
that so far there have been no standardized evaluation criterion. This issue needs to be further 
explored. Fourth, we described a numerous NLP research which is expected to enhance word 
sense disambiguation. Finally, we discussed a limitation of current word sense disambiguation 
and a possible solution for this problem. 

In the following chapters, we will tackle some of identified problems. Chapter ^ describes 
overall design of our verb sense disambiguation system, in which we propose a method to weigh 
the degree to which each feature contributes to disambiguation. Chapter ^ proposes novel 
methods to minimize the overhead required for our system, i.e. the overhead for supervision 
on large corpora and the overhead for searching large corpora. Finally, Chapter ^ explores the 
similarity computation integrating hand-crafted thesaurus and statistical information, which is 
expected to enhance the similarity-based word sense disambiguation. 
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Chapter 3 



Verb Sense Disambiguation System 



This chapter describes a similarity-based method verb sense disambiguation system. 
The disambiguation process follows the nearest neighbor method. Given an input 
consisting of a polysemous verb and its governing case complements (or cases), the 
system searches a database for the example most similar to the input. Thereafter, 
the polysemy of the verb is disambiguated by superimposing the sense of the verb 
associated with the retrieved example. The system uses a hand-crafted thesaurus 
or alternatively co-occurrence statistics, for the similarity computation. The main 
interest in this chapter lies in the introduction of the notion of 'case contribution 
to disambiguation' (CCD), which is a weighting schema for each case in similarity 
computation. Intuitively speaking, greater diversity of semantic range of case filler 
examples will lead to that case contributing more highly to verb sense disambiguation. 
We also report the results of comparative experiments, in which the performance of 
disambiguation is improved by considering the CCD factor. 



3.1 Overall Design 

This chapter elaborates on our proposed similarity-based verb sense disambiguation system^ 
and its evaluation through experimentation. Although our system is currently implemented 
for Japanese word sense disambiguation, the methodology can theoretically be applied to other 



languages. Let us briefly explain the disambiguation process based on Figure |3.1| . In this figure, 



"input" denotes a (complex/simple) sentence containing polysemous verb(s). Then, "morpho- 



^We originally used the term "example-based" method jsi], However, we shall use the more 

general "similarity-based" terminology in this research, because no explanatory distinction between similar ap- 
proaches, such as the case-based and exemplar-based reasoning, has been identified. 
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logical and syntactic analyzer" extracts predicate- argument structures ("pred-args")^ from the 
input. It should be noted that in the case of Japanese, morphological analysis entails lexi- 
cal segmentation as well as part-of-speech tagging. Thereafter, given the pred-args, the core 
of the verb sense disambiguation system ("VSD") outputs plausible sense(s) for each verb in 
the pred-args. In this process, the VSD uses supervised examples (in the "database") and 
"thesaurus" / "co-occurrence" data. The VSD additionally outputs the degree on certainty of its 
decision ("certainty"). Currently, we rely on existing NLP tools for the morph/syntax analyzer, 
and this component of the system is beyond the scope of this research. In fact, for Japanese, a 
number of existing tools such as JUMAN (a morphological analyzer) [^8| and QJP (a morpho- 
logical and syntactic analyzer) |^2[ have performed promisingly. Beside these tools, a number 
of simple heuristics to identify predicate-argument structures have been proposed for verb sense 
disambiguation |19, |3^, in an attempt to minimize the overhead for syntactic analysis. To sum 
up, the focus of this chapter can be identified by the portion of the overall system enclosed 



within the dashed region of Figure 3.1. Hereafter, "system" refers to the VSD, and we shall use 



the term "inputs" to refer to the inputs which are morphologically and syntactically analyzed 
(in other words, predicate-argument structures). 

Section p.2| overviews the basis of our verb sense disambiguation system, and then Sec- 



tion 3.3 elaborates on the disambiguation methodology. Section 3.4 evaluates our our system by 



way of experiments. Section describes a way to further enhance our system, including the 
computation of the certainty degree. 



( input ) 



morph/syntax analyzer 



^ pred-args ) 



database 



VSD 



thesaurus 
co-occurrence 



^verb sense/certainty) 



Figure 3.1: The overall design of the verb sense disambiguation system 
"^We shall interchangeably use the terms "predicate-argument" and "verb-complement" structures. 
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3.2 Basic Idea 



The core mechanism in our verb sense disambiguation system is based on the method proposed 



by Kurohashi and Nagao fT^] and later enhanced by Fujii et al |37|, |3g]. The system uses 
an example-database (database, hereafter) containing examples of collocations for each verb 
sense and its associated case frame(s). It should be noted that while the term "case" has been 
used variously by different researchers, including surface and deeper case level senses, we shall 
consistently use "case" with reference to the surface leveP . Figure |3.2| shows a fragment of the 
entry associated with the Japanese verb toru. The verb torn has multiple senses, a sample of 
which are "to take/steal", "to attain", "to subscribe" and "to reserve". The database specifies 
the case frame(s) associated with each verb sense. In Japanese, a complement of a verb consists 
of a noun phrase (case filler) and its case marker suffix, for example ga (nominative) or wo 
(accusative). The database also lists several case filler examples for each case. In practice, the 
database is usually compiled from machine readable dictionaries (MRDs) and text corpora. The 
task of the system is "to interpret" the verbs occurring in the input text, i.e. to choose one 
sense from among a set of candidates^. 



toru: 



suri (pickpocket) 
kanojo (she) 
ani (brother) 



kane 

saifu 

otoko 

uma 

aidea 



(money) 

(wallet) 

(man) 

(horse) 

(idea) 



toru (to take/steal) 



kare (he) ~) 

kanojo (she) > ga 

gakusei (student) J 



menkyoshou (license) 
shikaku (qualification) 
hiza (visa) 



wo toru (to attain) 



kare (he) ~| 
chichi (father) > ga 
kyaku (client) J 



shinbun (newspaper) 



zasshi 



(journal) 



tor-u (to subscribe) 



kare (he) 

dantai (group) 

ryokoukyaku (passenger) 

joshu (assistant) 



{kippu (ticket) ~) 
heya (room) > wo 
hikouki (airplane) J 



toru (to 



Figure 3.2: A fragment of the database, and the entry associated with the Japanese verb toru 



All verb senses we use are defined in the machine readable dictionary "IPAL" |5g] (which 



^Conventional case systems for natural language are reviewed, for exam ple, by Br uce 

^Note that unlike the automatic acquisition of word sense definitions 115, 144, 158|, the task of the system 
is to identify the best matched category with a given input, from predefined candidates. 
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parallels that of Kurohashi and Nagao [^]). Let us devote a little space to explaining IPAL. 
IPAL lists about 900 Japanese verbs and categorizes each verb into multiple "subentries" , based 
on verbal syntax and semantics. In our case, subentries are equivalent to verb senses. Figure 
shows a fragment of a subentry for the Japanese verb kakeru, including the subentry number, 
definition of the sense, case pattern, semantic features for case fillers, and example case filler 
nouns. The "case pattern" entry defines the obligatory cases associated with the verb sense, in 
which Ni represents the i-th. case filler^. In addition, optional cases are indicated in parentheses 
(such as N2-ni in Figure |3^ ) . IPAL defines 19 semantic feature types, which comprise a hierar- 
chy, as shown in Figure |3.4 In other words, these semantic features can be used as selectional 



restrictions (see Section 2.2.1). However, according to experiments conducted by Kurohashi and 
Nagao l?^ , the IPAL semantic features are not sufficiently fine-grained for rule-based verb sense 
disambiguation. Kurohashi and Nagao thus implemented their similarity-based method based 
on examples listed in each "example nouns" entry, and showed its effectivity over the rule-based 
method. Although Kurohashi and Nagao used only examples listed in IPAL, their method is 
open for the use of additional examples. In our case, we take from IPAL only the "case pattern" 
and "example nouns" entries to initialize the database, and later enhance the database with 
additional supervised examples. 



# subentry 


008 


definition 


attach something to a body 


case pattern 


Ni-ga {N2-ni) N3-W0 




semantic feature 


human 


example nouns 


kare (he), kanojo (she) 


N2 


semantic feature 


parts 


example nouns 


kata (shoulder), te (hand), kubi (neck) 


N3 


semantic feature 


products 


example nouns 


epuron (apron), megane (glasses), pendanto (pendant) 



Figure 3.3: An example of a subentry for the Japanese verb kakeru 



Let us now turn to the process of verb sense disambiguation. Given an input, which consists 
of a polysemous verb and its governing complements, the system identifies the verb sense on the 
basis of the scored similarity between the input and the examples given for each verb sense. Let 
us look at Figure |3.2| again and take the sentence below as an example input: 

(5) hisho ga shindaisha wo toru. 

(secretary-NOM) (sleeping car-ACC) (?) 



^Note that IPAL describes only the typical case order, and that word order in Japanese is less strictly restricted 
than other language such as English. 
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< 



animal 

organization 



human 




concrete 



plant 

parts 



natural 
products 



diverse 



phenomenon 




action 
mental 

linguistic products 



abstract 



character 
relation 
location 
time 



quantity 



Figure 3.4: The hierarchy of IPAL semantic features 



In this example, one may consider hisho ("secretary") and shindaisha ("sleeping car") to be 
semantically similar to jo shu ("assistant") and hikouki ("airplane") respectively, and since both 
collocate with the "to reserve" sense of torn, one could infer that torn should be interpreted as 
"to reserve". This resolution can be called the nearest neighbor method because the verb in 
the input is disambiguated by superimposing the sense of the verb appearing in the example of 
highest similarity^ . As one can see, the similarity between an input and an example is estimated 
based on the similarity between case fillers marked with the same case. While the notion of 
nearest neighbor does not predefine the type of similarity measurement used, we will explain 
two different types of similarity measurement in the following sections. 

Furthermore, since the restrictions imposed by the case fillers in choosing the verb sense are 
not equally selective, we introduce a weighted 'case contribution to disambiguation (CCD)' of 
the verb senses. Let us consider another example input: 

(6) gakusei ga shuukanshi wo toru. 



The nominative, gakusei ("student"), in sentence (6) is found in the "to attain" case frame 
of toru and there is no other co-occurrence in any other sense of toru. Therefore, the nomi- 
native supports an interpretation "to attain". On the other hand, the accusative, shuukanshi 

^Hereafter, "similarity-based" systems basically refers to those which are based on the nearest neighbor method. 



(student-NOM) (magazine-ACC) (?) 
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("magazine"), is most similar to the examples included in the accusative of "to subscribe" and 
therefore the accusative supports another interpretation "to subscribe". Although the most 
plausible interpretation here is actually the latter, the former would be chosen if one always 
relies equally on the similarity in the nominative and the accusative. However, in the case of 
torn, since the semantic range of nouns collocating with the verb in the nominative does not 
seem to have a strong delinearization in a semantic sense (in Figure |3.2| , the nominative of each 
verb sense displays the same general concept, i.e. HUMAN), it would be difficult, or even risky, 
to properly interpret the verb sense based on the similarity in the nominative. In contrast, since 
the semantic ranges are disparate in the accusative, it would be feasible to rely more strongly 
on the similarity here. 



This argument can be illustrated as in Figure 3.5, in which the symbols ei and 62 denote 
example case fillers of different case frames, and an input sentence includes two case fillers 
denoted by x and y. The figure shows the distribution of example case fillers for the respective 
case frames, denoted in a semantic space. The semantic similarity between two given case fillers 
is represented by the physical distance between the two symbols. In the nominative, since x 
happens to be much closer to an 62 than any ei, x may be estimated to belong to the range 
of 62 's, although x actually belongs to both sets of ei's and e2's. In the accusative, however, 
y would be properly estimated to belong to the set of ei's due to the disjunction of the two 
accusative case filler sets, even though examples do not fully cover each of the ranges of ei's and 
62 's. Note that this difference would be critical if example data were sparse. One may argue 
that this property can be generalized as the notion that the system always relies only on the 
similarity in the accusative for verb sense disambiguation. Although some typical verbs show 
this general notion, it is not guaranteed for any kind of verb. Our approach, which computes the 
degree of contribution for each verb respectively, can handle exceptional cases as well as typical 
ones. We will explain the method used to compute CCD in Section |3.3|. 




Figure 3.5: The semantic ranges of the nominative and accusative for the verb torn 
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3.3 Methodology 

To illustrate the overall algorithm, we will consider an abstract specification of both an input 
and the database (see Figure |3^) . Let the input be {^c^-mcj^, nc2-Tnc2^ f^c^-f^c^i^ v}, where 
denotes the case filler for the case q, and denotes the case marker for q, and assume that 
the interpretation candidates for v are derived from the database as si, S2 and S3. The database 
also contains a set £si,Cj of case filler examples for each case cj of each sense Si (" — " indicates 
that the corresponding case is not allowed). 



input 




nc2-mc2 




v{7) 










— v{si) 


database 








£s2,Ci V (S2) 






^S3,C2 







Figure 3.6: An input and the database 



During the verb sense disambiguation process, the system first discards those candidates 
whose case frame does not fit the input. In the case of Figure |3.6| , S3 is discarded because the 
case frame of v (S3) does not subcategorize for the case ci In contrast, S2 will not be rejected 
at this step. This is based on the fact that in Japanese, cases can be easily omitted if they are 
inferable from the given context. Note that no omission of case fillers is allowed in the database 
for the reasonable system coverage. In the case of IPAL, example case fillers cover every slot. 

In the next step the system computes the score of the remaining candidates and chooses as 
the most plausible interpretation the one with the highest score. The score of an interpretation 
is computed by considering the weighted average of the similarity degrees of the input case fillers 
with respect to each of the example case fillers (in the corresponding case) listed in the database 
for the sense under evaluation^. Formally, this is expressed by Equation |3.1| , where Score{s) 
is the score of sense s of the input verb, and SIM{nc, £s,c) is the maximum similarity degree 
between the input case filler ric and the corresponding case fillers in the database example set 
£s,c (calculated through Equation |3.2D . CCD{c) is the weight factor of case c, which we will 



^Since IPAL does not necessarily enumerate all the possible optional cases, the absence of case ci from v (sa) 
in the figure may denote that ci is optional. If so, the interpretation S3 should not be discarded in this stage. 
To avoid this problem, we use the same technique as used in Kurohashi's method. That is, we define several 
particular cases beforehand, such as the nominative, the accusative and the dative, to be obligatory, and impose 
the grammatical case frame constraints above only for these obligatory cases. Optionality of case needs to be 
further explored. 

*£s2,C4 is not taken into consideration in the computation since C4 does not appear in the input. 
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explain later in this section. 



Score{s) 



Y.cSIM{nc,£s^c)-CCD{c) 



SIM{nc,£s,c) = max sim(nc, e) 



(3.1) 
(3.2) 



It should be noted that while Equation 3.1 proposes one implementation of the score computa- 



tion, there can be a number of variations. For example, an alternative model could not consider 
cases with diminished CCD values (for the verb under evaluation) in this computation. In fact, 
one may notice that introduction of the CCD factor can be characterized as one type of "feature 
selection"^: cases associated with greater CCD values consist of a useful feature set. In the 
extreme case, one can rely solely on the case with greatest CCD value^''. Optimization of the 
model still remains as an open question and needs further exploration. 

The following two sections detail the computation of similarity between case fillers, and the 
CCD factor. 



3.3.1 Computation of similarity between case filler nouns 

With regard to the computation of the similarity between two different case fillers {sim{nc, e) 



in Equation 3J.), we experimentally used two alternative approaches. The first approach uses 



semantic resources, that is, hand-crafted thesauri (such as Roget's thesaurus [14| or WordNet 

in the case of English, and Bunruigoihyo [|l02f| , EDR (S^ or Goi-Taikei [55|^^ in the case of 



Japanese), based on the intuitively feasible assumption that words located near each other within 
the structure of a thesaurus have similar meaning. Therefore, the similarity between two given 
words is represented by the length of the path between them in the thesaurus structure 
38, 77, 81, |142| ]^^. We used the similarity function empirically identified by Kurohashi and 



Nagao [77], in which the relation between the length of the path in the Bunruigoihyo thesaurus 



and the similarity, is defined as shown in Table |3.1| . In this thesaurus, each entry is assigned a 
7 digit class code. In other words, this thesaurus can be considered as a tree, 7 levels in depth, 



with each leaf as a set of words. Figure 3.7 shows a fragment of the Bunruigoihyo thesaurus 



including some of the nouns in both Figure 3.2 and the input sentence above. 

The second approach is based on statistical modeling. We adopted one typical implemen- 



®See the ite m "P robabilistic models" (p. 14) in Section 2.2.1 for past feature selection methods. 
^°Yarowsky [157 , for example, advocated a simple implementation using only the single most useful feature. 



Goi-Taikei is a relatively newly released Japanese dictionary. 
^^Different types of a ppli cation of han d-craf ted thesauri to word sense disambiguation have been proposed, for 



example, by Yarowsky [154] (see Section 2.2.2) 
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kare kanojo otoko joshu hisho kane heya kippu uma 
(he) (she) (man) (assistant) (secretary) (money) (room) (ticket) (horse) 



Figure 3.7: A fragment of the Bunruigoihyo thesaurus 



Table 3.1: The relation between the length of the path between two nouns ni and n2 in the 
Bunruigoihyo thesaurus (/en(ni, 712)), and their relative similarity (sim(ni, 71-2)) 



len{ni, 712) 


2 4 6 8 10 12 


sim{ni,n2) 


11 10 9 8 7 5 



tation called the "vector space model" (VSM) 106 , 127, |130f| , which has a long history 

of application in information retrieval (IR) and text categorization (TC) tasks. In the case of 
IR/TC, VSM is used to compute the similarity between documents, which is represented by a 
vector comprising statistical factors of content words in a document. Similarly, in our case, each 
noun is represented by a vector comprising statistical factors, although statistical factors are 
calculated in terms of the predicate-argument structure in which each noun appears. Predicate- 
argument structures, which consist of complements (case filler nouns and case markers) and 
verbs, have also been used in the task of noun classification |5S]. This can be expressed by 
Equation where ft is the vector for the noun in question, and items ti represent the statistics 
for predicate- argument structures including n. 

n =< ti, t2, . . . , ti, ... > (3.3) 



In regard to ti, we used the notion of TF-IDF 127 |. TF (term frequency) gives each context (a 



case marker/verb pair) importance proportional to the number of times it occurs with a given 
noun. The rationale behind IDF (inverse document frequency) is that contexts which rarely 
occur over collections of nouns are valuable, and that therefore, the IDF of a context is inversely 
proportional to the number of noun types that appear in that context. This notion is expressed 



by Equation |3.4| , where f{<n,c,v>) is the frequency of the tuple <n,c,v>, nf{<c,v>) is the 



number of noun types which collocate with verb v in the case c, and is the number of noun 
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types within the overah co-occurrence data. 

N 

ti = f{<n,c,v>) - log— — — (3.4) 

nf[<c,v>) 

We then compute the similarity between nouns ni and n2 by the cosine of the angle between 



the two vectors rTi and 712- This is realized by Equation 3.5 



stm[ni,n2j = , ^ ^ , W-^) 

ni 722 



We extracted co-occurrence data from the RWC text base RWC-DB-TEXT-95-1 [|119[. This 
text base consists of 4 years worth of "Mainichi Shimbun" newspaper articles [^], which have 
been automatically annotated with morphological tags. The total morpheme content is about 
100 million. Since full parsing is usually expensive, a simple heuristic rule was used in order 
to obtain collocations of nouns, case markers and verbs in the form of tuples <n,c,v>. This 
rule systematically associates each sequence of noun and case marker to the verb of highest 
proximity, and produced 419,132 tuples. This co-occurrence data was used in the preliminary 



experiment described in Section 3.4^^ 



3.3.2 Computation of case contribution factor 



In Equation |3.l| , CCD{c) expresses the weight factor of the contribution of case c to (current) 
verb sense disambiguation. Intuitively speaking, preference should be given to cases displaying 
case fillers which are classified in semantic categories of greater disjunction. As such, c's contri- 
bution to the sense disambiguation of a given verb, CCD{c), is likely to be higher if the example 



case filler sets {£si,c I i = 1, ■ ■ ■ ,n} share fewer elements as in Equation 3.6 . 

CCD(.) = V V l^'-^.cl + l%c|-2|g.,cng„.d \ ^^^^^ 




Here, a is a constant for parameterizing the extent to which CCD influences verb sense disam- 
biguation. The larger a, the stronger CCD's influence on the system output. To avoid data 
sparseness, we smooth each element (noun example) in £si,c- In practice, this involves general- 
izing each example noun into a 5 digit class based on the Bunruigoihyo thesaurus, as has been 
commonly used for smoothing. 



Note that each verb in co-occurrence data should ideally be annotated with its verb sense. However, there is 
no existing Japanese text base with sufficient volume of word sense tags. 
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3.4 Experimentation 

We estimated the performance of our verb sense disambiguation method through a comparative 
experiment with other existing methods, in which we compared the fohowing five methods: 

• lower bound (LB), in which the system systematicahy chooses the most frequently appear- 



ing verb sense in the database [42 



• rule-based method (RB), in which the system uses a thesaurus to (automatically) identify 
appropriate semantic classes as selectional restrictions for each verb complement, 

• Naive-Bayes method (NB), in which the system interprets a given verb based on the 
probability that it takes each verb sense, 

• similarity-based method using the vector space model (VSM), in which the system uses 
the above mentioned co-occurrence data extracted from the RWC text base, 

• similarity-based method using the Bunruigoihyo thesaurus (BGH), in which the system 



uses Table 3.1 for the similarity computation. 



Note that the last two similarity-based methods consider CCD factor for the similarity compu- 
tation. 

In the rule-based method, the selectional restrictions are represented by thesaurus classes, 
and allow only those nouns dominated by the given class in the thesaurus structure as verb 
complements. In order to identify appropriate thesaurus classes, we used the association measure 
proposed by Resnik |120| ] , which computes the information-theoretic association degree between 



case fillers and thesaurus classes, for each verb sense^"^. Equation 3^ duplicates Equation 2A 
here for the sake of enhanced readability. 

A{s,c,r) = Pir\s,c)-log^^^ (3.7) 

F[r\c) 

Here, A{s, c, r) is the association degree between verb sense s and class r (restriction candidate) 
with respect to case c. We used the semantic classes defined in the Bunruigoihyo thesaurus. 
P{r\s,c) is the conditional probability that a case filler example associated with case c of sense s is 
dominated by class r in the Bunruigoihyo thesaurus. P{r\c) is the conditional probability that a 
case filler example for case c (disregarding verb sense) is dominated by class r. Each probability 
is estimated based on the distribution obtained from the training data. In practice, every r 



14 



Note that previous resea rch has apphed this technique to tasks other than verb sense disambiguation, such 



as syntactic disambiguation |120l and disambiguation of case filler noun senses [124] 
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whose association degree is above a certain threshold is chosen as a selectional restriction |12 



124]. Intuitively speaking, by decreasing the value of the threshold, the system coverage can be 



broadened while opening the way for irrelevant (noisy) selectional restrictions. 

The Naive-Bayes method, which is one type of Bayesian classification, assumes that each 
feature (i.e. case filler) included in a given input is conditionally independent of other case 
fillers: the system approximates the probability that an input x takes a verb sense s, P{s\x), 
simply by computing the product of the probability that each verb sense s takes tlq as a case 
filler for case c. The verb sense with maximal probability is then selected as the interpretation 
(Equation U)^^ 

, , P(s)-P(x\s) 
argmax/^(s[xj = arg max 



s ' ' ' ^ s P{x) 

= argmaxP(s) • P(x|s) (3-8) 

~ argmaxP(s) JJ^ P(nc|s) 
^ c 

Here, P{nc\s) is the probability that a case filler associated with sense s for case c in the training 
data is ric- We estimated P{s) based on the distribution of the verb senses in the training data. 
In practice, data sparseness leads to not all case fillers ng appearing in the database, and as 
such, we generalize each nc into semantic class defined in the Bunruigoihyo thesaurus. 

A number of methods involve a parametric constant: the threshold value for the association 
degree (RB), a generalization level for case filler nouns (NB), and a in Equation |3.6| (VSM 
and BGH). For these parameters, we conducted several trials prior to the actual comparative 
experiment, to determine the optimal parameter values over a range of data sets. For our 
method, we set a extremely large, which is equivalent to virtually relying solely on the SIM of 
the case with greatest CCD. However, note that when the SIM of the case with greatest CCD 
is equal for multiple verb senses, the system computes the SIM of the case with second highest 
CCD. This process is repeated until only one verb sense remains. When more than one verb 
sense is selected for any given method (or none of them remains, for the rule-based method), 
the system simply selects the verb sense that appears most frequently in the database^^. 



A number of experim ental resu lts have shown the effectivity of the Naive-Bayes method for word sense 
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disambiguation [78|, 

^^One may argue that this goes against the basis of the rule-based method, in that, given a proper threshold 
value for the association degree, the system could improve on the accuracy (potentially sacrificing the coverage), 
and that the trade-off between the coverage and the accuracy is therefore a more appropriate evaluation criterion. 
However, our trials on the rule-based method with different threshold values did not show significant correlation 
between the improvement of the accuracy and the degeneration of the coverage. 
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In the experiment, we conducted six-fold cross validation, that is, we divided the training/test 
data into six equal parts^^, and conducted six trials in which a different part was used as test data 
each time, and the rest as training data (the database). We evaluated the performance of each 
method according to its accuracy, that is the ratio of the number of correct outputs, compared 
to the total number of inputs. The training/test data used in the experiment contained about 
one thousand simple Japanese sentences collected from news articles^^. Each sentence in the 
training/test data contained one or more complement(s) followed by one of the eleven verbs 



described in Table 3.2. In Table 3.2, the column of "English gloss" describes typical English 
translations of the Japanese verbs. The column of "7^ of sentences" denotes the number of 
sentences in the corpus, and "7^ of senses" denotes the number of verb senses contained in 



IPAL 1 56]. The column of "accuracy" shows the accuracy of each method. 



Table 3.2: The verbs contained in the corpus used, and the accuracy of the different verb sense 
disambiguation methods (LB: lower bound, RB: rule-based method, NB: Naive-Bayes method, 
VSM: vector space model, BGH: the Bunruigoihyo thesaurus) 



verb 


English gloss 


#of 
sentences 


#of 
senses 


accuracy (%) 


LB 


RB 


NB 


VSM 


BGH 


ataeru 


give 


136 


4 


66.9 


62.1 


75.8 


84.1 


86.0 


kakeru 


hang 


160 


29 


25.6 


24.6 


67.6 


73.4 


76.2 


kuwaeru 


add 


167 


5 


53.9 


65.6 


82.2 


84.0 


86.8 


motomeru 


require 


204 


4 


85.3 


82.4 


87.0 


85.5 


85.5 


noru 


ride 


126 


10 


45.2 


52.8 


81.4 


80.5 


85.3 


osameru 


govern 


108 


8 


30.6 


45.6 


66.0 


72.0 


74.5 


tsukuru 


make 


126 


15 


25.4 


24.9 


59.1 


56.5 


69.9 


torn 


take 


84 


29 


26.2 


16.2 


56.1 


71.2 


75.9 


umu 


bear offspring 


90 


2 


83.3 


94.7 


95.5 


92.0 


99.4 


wakaru 


understand 


60 


5 


48.3 


40.6 


71.4 


62.5 


70.7 


yameru 


stop 


54 


2 


59.3 


89.9 


92.3 


96.2 


96.3 


total 




1315 




51.4 


54.8 


76.6 


78.6 


82.3 



Looking at Table 3.2, one can see that our similarity-based method outperformed the other 
methods (irrespective of the similarity computation), although the Naive-Bayes method is rela- 
tively comparable in performance. Surprisingly, despite the relatively ad-hoc similarity definition 



utilized (see Table 3.1), the Bunruigoihyo thesaurus led to a greater accuracy gain than the vec- 



tor space model. In order to estimate the upper bound (limitation) of the disambiguation task. 



'^^Ideally speaking, training and test data should be drawn from different sources, to simulate a real application. 
However, the sentences were already scrambled when provided to us, and therefore we could not identify the 
original source corresponding to each sentence. 

^^Morph/syntax analyses were manually conducted on the corpus to avoid errors potentially caused by existing 
tools 
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that is, to what extent a human expert makes errors in disambiguation [42|, we analyzed incor- 
rect outputs and found that roughly 30% of the system errors using the Bunruigoihyo thesaurus 
fell into this category. It should be noted that while the vector space model requires computa- 
tional cost (time/memory) of an order proportional to the size of the vector, determination of 
paths in the Bunruigoihyo thesaurus comprises a trivial cost^^. 

In order to evaluate the effectivity of the CCD factor, we also investigated the accuracy 
rate for a similarity-based method which does not consider the CCD factor (as performed by 
Kurohashi and Nagao [^). In other words, this method computes the score for verb sense s 
simply by summing the similarity degrees of the input case fillers with respect to each of the 



example case fillers, as in Equation 3.9. We used the Bunruigoihyo thesaurus for the similarity 



computation between two case filler nouns. 

Scoreis) = ^ SIM{nc, £s,c) (3.9) 
c 

We found that the CCD factor led to an accuracy gain from 76.9% to 82.3% (5.4% gain). 

We also investigated how the accuracy of each method improved as the training data was 
increased, because the performance of corpus-based methods generally depends on the size of 
training data. For this purpose, we initially used only the examples taken from IPAL, and 



progressively increased the size of the training data used. Figure 3^ shows the the relation 
between the number of the training data used and the accuracy of different methods. In this 
figure, zero on the x-axis represents the system using only the IPAL examples, which imitates 
a dictionary definition-based method^'^. We derive from this figure that our method using the 
Bunruigoihyo outperformed the other methods, irrespective of the size of training data, and 
that enhancing the volume of training data significantly improved on the accuracy using only 
the IPAL examples. The latter observation justifies the necessity of supervision on large-sized 



training data, as shown in past experiments pq , 104|. 

One may argue that given sufficient co-occurrence statistics, the vector space model should 
outperform hand-crafted thesauri, in other words, human lexicographers' knowledge is no longer 



needed. We investigate this prediction in Table 3.S, which shows the the relation between the 
frequency of nouns appearing in the co-occurrence data extracted from the RWC text base 
(see Section |3.3| ) and the accuracy of verb sense disambiguation, in which the "frequency" entry 
denotes the threshold of the frequency of nouns. The "coverage" entry denotes the ratio between 
the number of inputs including at least one noun with frequency over a given threshold, and the 

^^We will propose a method to optimize the computational cost for the vector space model in Chapter 
^"See the item "Machine readable dictionaries" (p. 25) in Section 2.2.2. 
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Figure 3.8: The relation between the number of training data used and accuracy of the different 
methods 
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total number of inputs. The last two entries show the accuracy with different similarity measures, 
for each coverage. Surprisingly, not only the accuracy of VSM but also the accuracy of the 
Bunruigoihyo thesaurus increased as the threshold of the frequency increased, and VSM did not 
outperform the Bunruigoihyo thesaurus for any of the thresholds. We could assume frequently 
appearing nouns are used so commonly that even human lexicographers can reasonably define 
the similarity between them in a thesaurus. In addition, nouns which frequently appear in the 
co-occurrence data also appear in the training data, and therefore they provide the maximal 
similarity (that is, "exact matching") independent of which similarity measure is used. We 
would like to note that human knowledge is useful in the task of word sense disambiguation, as 
with other NLP research [|69| ]. 



Table 3.3: The frequency of nouns and resultant accuracy of verb sense disambiguation 



frequency 


>100 


>1000 


> 10000 


coverage 


72.4% 


57.2% 


16.3% 


VSM 


83.5 


87.9 


91.4 


BGH 


88.4 


92.0 


94.4 



We also investigated errors made by the rule-based method to find a rational explanation 
for its inferiority. We identified that the association measure in Equation tends to give a 
greater value to less frequently appearing verb senses and lower level (more specified) classes, 
and therefore chosen rules are generally overspecified^^ . Consequently, frequently appearing verb 
senses are likely to be rejected. On the other hand, when attempting to enhance the rule set by 
setting a smaller threshold value for the association score, overgeneralization can be a problem. 
We also note that one of the theoretical differences between the rule-based and similarity-based 
methods is that the former statically generalizes examples (prior to system usage), while the 
latter does so dynamically. Static generalization would appear to be relatively risky for sparse 
training data. 

Although comparison of different approaches to word sense disambiguation should be further 
investigated, this experimental result gives us good motivation to explore similarity-based verb 
sense disambiguation approaches in the following sections and chapters. 



This problem has also been identified by Charniak 
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3.5 Further Enhancement 

3.5.1 Computation of interpretation certainty 



Since, as shown in Table 3^, the system still finds it difficult to achieve a 100% accuracy, it is 
important to select presumably correct outputs from the overall outputs (potentially sacrificing 
system coverage), for practical applications. Integrated with other NLP systems, it is desirable 
that other systems can vary the degree of reliance on the output of the verb sense disambiguation 
system. To achieve this, it is useful to estimate the degree of certainty as to the interpretation, 
so that one can gain higher accuracy selecting only outputs with greater certainty degree. 

A number of methods have been proposed to compute the interpretation certainty in word 



sense disambiguation 157] and text categorization [30|. These methods estimate inter- 
pretation certainty as the ratio between the probability of the most plausible category (word 
sense/text category), and the probability of any other category, excluding the most probable one. 
Similarly, in the verb sense disambiguation system, we introduce the notion of interpretation 
certainty based on the following preference conditions |36]: 

1. the highest interpretation score is greater, 

2. the difference between the highest and second highest interpretation scores is greater. 



The rationale for these conditions is given below. Consider Figure 3.9, where each symbol 
denotes an example in a given corpus, with symbols x as unsupervised examples and symbols e 
as supervised examples. The curved lines delimit the semantic vicinities (extents) of the two 
verb senses 1 and 2, respectively^'^. The semantic similarity between two examples is graphically 
portrayed by the physical distance between the two symbols representing them. In Figure 3.9- 
a, x's located inside a semantic vicinity are expected to be interpreted as being similar to 
the appropriate example e with high certainty, a fact which is in line with condition 1 above. 
However, in Figure 3.9-b, the degree of certainty for the interpretation of any x which is located 
inside the intersection of the two semantic vicinities cannot be great. This occurs when the case 
fillers associated with two or more verb senses are not selective enough to allow for a clear cut 
delineation between them. This situation is explicitly rejected by condition 2. 

Based on the above two conditions, we compute interpretation certainties using Equa- 



tion 3.10 , where C{x) is the interpretation certainty of an example x. Scorei{x) and Score2{x) 
are the highest and second highest scores for x, respectively. A, which ranges from to 1, is a 
parametric constant used to control the degree to which each condition affects the computation 



Note that this method can easily be extended for a verb which has more than two senses. 
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sense 2 




Figure 3.9-a: The case where the interpretation 
certainty of the enclosed x's is great 



sense 1 

>rsense 2 

^ \ x) ^ ^ 
\x x\ ' X 



Figure 3.9-b: The case where the interpretation 
certainty of the x's contained in the intersection 
of senses 1 and 2 is small 



of C(x) 



Figure 3.9: The concept of interpretation certainty 



C{x) = X ■ Scorei{x) + (1 — A) • {Scorei{x) — Score2{x)) 



(3.10) 



We estimate the validity of the notion of the interpretation certainty, by the trade-off between 
the accuracy and coverage of the system. Note that in this experiment, the accuracy is the 
ratio between the number of correct outputs, and the number of cases where the interpretation 
certainty of the output is above a certain threshold. The coverage is the ratio between the 
number of cases where the interpretation certainty of the output is above a certain threshold, 
and the number of inputs. By raising the value of the threshold, the accuracy also increases (at 
least theoretically), while the coverage decreases. 

The system used the Bunruigoihyo thesaurus for the similarity computation, and was eval- 
uated by way of six-fold cross validation using the same corpus as that used for the experiment 



described in Section 3.4. Figure 3.10 shows the result of the experiment with several values of 



A, from which the optimal A value seems to be in the range around 0.5. It can be seen that, as 
we assumed, both of the above conditions are essential for the estimation of the interpretation 
certainty. 



3.5.2 Incorporation of contextual constraints 

A number of researchers have pointed out that words tend to maintain the same sense within 



a given context |101, 157 |. In other words, when the same verb appears multiply in the same 
context, we can generally assume that it will take the same sense. The crucial issue then 
becomes which verb sense to select if each verb occurrence is interpreted with different sense 
by a dedicated method (for example, the methods compared in Section |3.4|) . Here, we newly 
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Figure 3.10: The relation between coverage and accuracy with different A's 
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introduce a method to propagate contextual constraint through the degree of interpretation 
certainty. Let us assume that a given input includes multiple distinct verbs of common lexical 
content, and that one of them is interpreted with (unreasonably) smaller certainty than the 
others. In the following, we list cases where the similarity cannot be reasonably computed: 

• insufficient supervised training data, 

• input case markers are topicalized or substituted for, 

• input cases are omitted, 

• input case filler nouns are unlisted in the thesaurus^'^ (or unlisted in the co-occurrence 
data in the case of the vector space model), 

• input case fillers comprise compound nouns. 

In such cases, we superimpose interpretations with higher certainty onto those with lower cer- 
tainty. It should be noted that while the four problems above can be individually solved through 
different approaches (some of which are discussed in Section 3.5.3| ), our approach can be used 



as a complementary solution in a general formalism. To exemplify our contextual propagation 
method, let us take the following example (taken from the EDR Japanese corpus [^]): 

(7) satou wo tsuka-wazuni . . . parachinousu wo tsukau. 

(sugar- ACC) (do not "use MATERIAL") ... (palatinose-ACC) (?) 

This example contains two occurrences of the polysemous Japanese verb tsukau, which has 
multiple senses in the EDR dictionary |5|] , such as "to employ" , "to operate" , "to spend" and 
"to use MATERIAL" . While the sense of the former tsukau can be correctly identified as "to use 
MATERIAL", the disambiguation of the latter tsukau fails because the Bunruigoihyo thesaurus 
does not list the technical term parachinousu ( "palatinose" ) , which is a artificial sweetener. 
However, we can identify the correct interpretation for the latter tsukau by superimposing the 
interpretation for the former one, i.e. "use MATERIAL". Computation of the degree of certainty 
is performed using the method proposed in Equation 3.10 (see Section p. 5.1 ). 



For the evaluation of this method, we used the EDR Japanese corpus [E^], which was origi- 



nally extracted from news articles. Note that the corpus used for experiments in Section 3.4 was 
not applicable to this evaluation because this corpus was compiled on a simple sentence basis. 
The EDR corpus provides sense information for each word, based on the EDR dictionary, and we 



Most thesauri, including WordNet and the Bunruigoihyo thesaurus, lack proper nouns and technical terms. 
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used this as a means of checking the interpretation . From the EDR corpus, we first collected 
sentences containing one of ten frequently appearing verbs, producing a total of 10,880 sen- 
tences. Thereafter, we collected sentences in which a polysemous verb appears more than once, 
from the aforementioned collection. We limited the range of context to one sentence because 
the EDR corpus does not provide wider contextual information, such as paragraph boundaries 
and sentence genres. The number of derived sentences was 462 (out of 10,880), which means 
the applicability of this method is relatively small. However, the use of contextual information 
enhanced the accuracy of the similarity-based method using the Bunruigoihyo thesaurus from 
60.4% to 64.1%. 



3.5.3 Remaining problems 

Let us discuss how further enhancements to our verb sense disambiguation system could be 
made. 

First, it should be noted that in Japanese, case markers can be omitted or topicalized (i.e. 
marked with postposition wa), an issue which our framework does not currently consider. In 
addition, polysemous verbs in relative clauses, where the surface case markers of the head noun 
is omitted, pose a similar problem. Kurohashi and Nagao |7^] proposed a way of modeling such 
(irregular) verb complements, by matching them to complements followed by ga, ni or wo based 
on the similarity between respective case fillers. Baldwin et al. |^] and Baldwin proposed a 
head gapping method for Japanese relative clauses, which identifies appropriate case slots for 
head nouns in relative clauses. Anaphora/ellipsis resolution is expected to overcome the case 
omission problem. Given that this processing can be carried out successfully, the similarity 
between an input and examples is expected to be more reliable. 

Second, our system is currently limited to the vocabulary defined in the Bunruigoihyo the- 



saurus. Thesaurus extending methods 138 , 143 1 are expected to counter this problem. The 



problem of vocabulary size is also problematic when case filler nouns comprise compound nouns, 
and therefore extraction of the semantic head is a crucial task. Note that in Japanese, compound 
nouns lack lexical segmentation. It has been empirically shown that in more than 90% of cases 
of four kanji character Japanese compound nouns^^, the semantic head consists of the last two 



characters [7C]. However, analysis of longer compound nouns still remains as a challenging task. 



Kobayashi et al. [70, 71 1, for example, proposed a method of analyzing the syntactic/semantic 



^*It should be noted that according to our preliminary observation, the EDR corpus contains a number of sense 
annotation errors and "nil" verb senses (unanalysed/unanalyseable verb senses). In view of this problem, we did 
not use this corpus for other experiments in this research. 

'^^ Kanji characters are basic ideograms in Japanese. 
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structure of Japanese compound nouns, which can further enhance our system coverage. 

Third, some idiomatic expressions represent highly restricted collocations, and overgeneraliz- 
ing them semantically through the use of a thesaurus can cause further errors. Possible solutions 
would include one proposed by Uramoto, in which idiomatic expressions are described separately 
in the database so that the system can control their overgeneralization [|142| (as discussed in 



Section |23). 

Finally, external information such as the discourse or domain dependency of each word 
sense [47, 101, 157] is expected to lead to system improvement^^. 



3.6 Summary 

This chapter described our similarity-based verb sense disambiguation system 38]. The basis 
of the system is as follows. First, provided with sentences containing a polysemous verb, the 
system searches the database for the most similar example to the input, following the nearest 
neighbor method. The database consists of supervised examples which are manually annotated 
with correct verb senses. Thereafter, the verb is disambiguated by superimposing the sense 
of the verb appearing in the most similar supervised example. The similarity (score) between 
the input and an example, or more precisely, the weighted average of the similarity between 
case filler nouns included in them, is computed based on either a statistical measure or an 
existing thesaurus. For the statistical similarity measure, we use the ubiquitous vector space 



model p3| , [Tq , 106, 127, 130|. In this, each case filler noun is represented as a vector comprising 
statistical factors about its collocation, which are taken from a large-scaled text base, and the 
similarity between two nouns is computed as the cosine of the angle between their associated 



vectors. As for the thesaurus-driven similarity measure |77, 142 1, we applied a method 



proposed by Kurohashi and Nagao [77|, which determines the similarity between two nouns 



as the length of the path between them in the hand-crafted Bunruigoihyo thesaurus [102]. In 
practice, morphological and syntactic analyses are needed prior to the verb sense disambiguation 
process. We currently assume the use of existing tools for morphological and syntactic analysis 
on the input, and do not focus on these modules in this research. 

Let us summarize the main points that have been made in this chapter. 

First, we introduced the weighting factor of case contribution to disambiguation (CCD), 
which computes the degree to which each case filler contributes to verb sense disambiguation. 
Intuitively speaking, greater diversity of semantic range of case filler examples will lead to 



'Note that we limited the range of context to one sentence in Section i.5.2, because the EDR corpus does not 



provide wider contextual information, such as paragraph boundaries and sentence genres. 
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greater contribution of that case to verb sense disambiguation. We then compute the similarity 
between the input and each supervised example, placing emphasis on cases with greater CCD 
values. We also identified that the introduction of the CCD factor can be seen as a feature 
selection method, that is, cases with greater CCD values consist of a useful feature set. In order 
to investigate the effectivity of CCD, we compared our system with the rule-based and Naive- 
Bayes methods. Empirical results showed that the similarity-based system combined with the 
CCD factor improved on the lower bound performance to a larger degree than the rule-based and 
Naive-Bayes methods. We also showed that the use of the Bunruigoihyo thesaurus is comparable 
with the vector space model in the similarity computation. While criticism has been made of 
inherent limitations in hand-crafted resources, our experimental results advocate the effectivity 
of these resources for NLP applications. In addition, we showed that additional supervised 
training examples significantly improved on the performance, relying solely on a small number 
of examples taken from the machine readable dictionary IPAL p6[ . 

Second, in order to achieve higher accuracy, we selected presumably correct outputs from 
the overall outputs by use of the notion of interpretation certainty. The interpretation certainty 
is greater when (a) the score associated with the selected verb sense is greater, and (b) the 
difference between the highest and second highest scores is greater. We showed the effectivity 
of this computation by way of experiments. 

Finally, our prototype method of propagating contextual constraints, further improved on 
the performance of the similarity-based method. 
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Chapter 4 

Selective Sampling of Effective 
Example Sets 

This chapter proposes an efficient example sampling method for similarity-based word 
sense disambiguation systems. To construct a database of a practical size, a con- 
siderable overhead for manual sense disambiguation ("overhead for supervision") is 
required. In addition, the time complexity of searching a large-sized database poses a 
considerable problem ("overhead for search"). To counter these problems, our method 
selectively samples a smaller-sized effective subset from a given example set for use 
in word sense disambiguation. Our method is characterized by its reliance on the no- 
tion of 'training utility ': the degree to which each example is informative for future 
example sampling when used for the training of the system. The system progres- 
sively collects examples by selecting those with greatest utility. This paper reports on 
the effectivity of our method through experiments on about one thousand sentences. 
Compared to experiments with other example sampling methods, our method reduced 
both the overhead for supervision and the overhead for search, without degenerating 
the performance of the system. 

4.1 Motivation 

In Chapter |3|, we described the similarity-based verb sense disambiguation system, and showed 
its effectivity through comparative experiments. Following the nearest neighbor method, our 
system uses the database, which contains example sentences associated with each verb sense. 
Given an input sentence containing a polysemous verb, the system chooses the most plausible 
verb sense from predefined candidates. In this process, the system computes a scored similarity 
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between the input and examples in the database, and choses the verb sense associated with 
the example which maximizes the score. To realize this, we have to manually disambiguate 
polysemous verbs appearing in examples, prior to their use by the system. We shall call these 



examples "supervised examples" . A preliminary experiment conducted in Section 3.4 on eleven 
polysemous Japanese verbs showed that (a) the more supervised examples we provided to the 
system, the better it performed, and (b) in order to achieve a reasonable result (say over 80% 
accuracy), the system needed a hundred-order supervised example set for each verb. Therefore, 
in order to build an operational system, the following problems have to be taken into account^: 

• given human resource limitations, it is not reasonable to supervise every example in large 
corpora ("overhead for supervision"), 

• given the fact that similarity-based systems, including our system, search the database for 
the examples most similar to the input, the computational cost becomes prohibitive if one 
works with a very large database size ("overhead for search"). 

Empirically speaking, we observed that about 84% of the 900 or so verbs defined in IPAL [^] 
are polysemous, and that some of these verbs are associated with debilitatingly large numbers 
of senses (the maximum number of senses is 32). This statistics shows that the overhead for 
supervision in a real application is never trivial. It should be noted that the overhead for 
supervision is also crucial when one tries to customize a WSD system to several distinct domains, 
because (a) word sense distinctions are often different depending on domains, and therefore (b) 
multiple overhead for supervision in each domain is required. 

One may argue that unsupervised WSD methods (for example, those described in Sec- 
tion 2.2.2| ) can overcome the former problem. To investigate this assumption, we preliminary 



applied the bootstrapping method to the corpus used in Section |3.4| . In this experiment, we 
first provided the similarity-based system (using the Bunruigoihyo thesaurus) with an initial 
database, consisting of examples taken from IPAL [^] (see the "example nouns" entry listed in 
Figure |3!^ ) . 

Thereafter, the system repeats the training process and automatically incorporates the exam- 
ple with maximal interpretation certainty into the database, until no training data remains. We 
used Equation 3.1C for the computation of interpretation certainty (see Section 3.5.1| ). Finally, 



we evaluated the accuracy of verb sense disambiguation on open test data. For this purpose, we 
divided the corpus into test and training data and conducted six-fold cross validation, as carried 



^Note that these problems are assoc i ated with corpus-based approaches in general, and have been identified 
by a number of researchers H, 1^, plll,[l57t. 
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out in Section 3.4. We found that the accuracy was 66.1%, which is less than satisfactory when 
compared with the accuracy of the supervised methods hsted in Table In fact, the boot- 
strapping method did not significantly improve on the accuracy rate achieved only using the 
initial database (64.7%). This experimental result advocates that we still need supervised meth- 
ods in certain cases, although we concede that unsupervised methods also need to be explored 
to counter the overhead for supervision. 

Motivated by these above arguments, we propose a different approach, namely to select a 
small number of optimally informative examples from given corpora |3^. Hereafter we will 
call these examples "samples" . Our example sampling method, based on the utility maximiza- 
tion principle, decides on the preference for including a given example in the database. This 
decision procedure is usually called "selective sampling" [p3]. The overall control flow of selec- 



tive sampling systems can be depicted as in Figure 4.1, where "system" refers to our verb sense 
disambiguation system, and "examples" refers to an unsupervised example set^. The sampling 
process basically cycles between the word sense disambiguation (WSD) and training phases. 
During the WSD phase, the system generates an interpretation for each polysemous verb con- 
tained in the input example ( "WSD outputs" ) . This phase is equivalent to normal word sense 
disambiguation execution. During the training phase, the system selects samples for training 
from the previously produced outputs. During this phase, a human expert supervises samples, 
that is, provides the correct interpretation for the verbs appearing in the samples. Thereafter, 
samples are simply incorporated into the database without any computational overhead (as 
would be associated with globally re-estimating parameters in statistics-based systems), mean- 
ing that the system can be trained on the remaining examples ("residue") for the next iteration. 
Iterating between these two phases, the system progressively enhances the database. It should 
be noted that the selective sampling procedure gives us an optimally informative database of a 
given size irrespective of the stage at which processing is terminated. 

Several researchers have proposed this type of approach for NLP applications. Engelson 
and Dagan [^] proposed a committee-based sampling method, which is currently applied to 
HMM training for part-of-speech tagging. This method sets several models (committee) taken 
from a given supervised data set, and selects samples based on the degree of disagreement 
among the committee members as to the output. However, as this method is implemented 
for statistics-based models, there is a need to explore how to formalize and map the concept of 



selective sampling into similarity-based approaches. Lewis and Gale |80] proposed an uncertainty 



^The "system" in Figure includes the "VSD", "thesaurus/co-occurrence" and "morph/syntax analyzer" 
from Figure However, note that the morphological and syntactic analyses for "examples" are needed only at 
the first stage of the iteration. 
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for next iteration 



Figure 4.1: Flow of control of the selective sampling 

sampling method for statistics-based text classification. In this method, the system always 
samples outputs with an uncertain level of correctness. However, we should take into account 
the training effect a given example has on other unsupervised examples, introduced as 'training 



utility' in our method. We devote Section 4.3 to further comparison of our approach and other 
related works. 

With respect to the problem of overhead for search, possible solutions would include the 
generalization of similar examples through a thesaurus taxonomy [ 



107] or the reconstruction 



of the database using a small portion of useful instances selected from a given supervised example 
set 



ISE]. Aha et al. [|[ select examples that significantly contribute to the performance gain 



on test data. However, such approaches imply a significant overhead for supervision of each 
example prior to the system's execution. This shortcoming is precisely what our approach aims 
to avoid: reduction of the overhead for supervision as well as the overhead for search. 



In the followings, Section 4.2 first elaborates on the methodology of our selective sampling 



method. Section |4.3| then evaluates our method by way of experiments. Section describes 



related work. Finally, discussion is added in Section 4.5. 



4.2 Methodology 



4.2.1 Overview 



Let us look again at Figure |4.1| , in which "WSD outputs" refers to a corpus in which each sentence 
is assigned an expected verb interpretation during the WSD phase. In the training phase, the 
system stores supervised samples (with each interpretation simply checked or appropriately 
corrected by a human) in the database, to be used in a later WSD phase. In this section, we 
turn to the problem as to which examples should be selected as samples. 

Lewis and Gale [^] proposed the notion of uncertainty sampling for the training of statistics- 
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based text classifiers. Their method selects those examples that the system classifies with 
minimum certainty, based on the assumption that there is no need for teaching the system the 
correct answer when it has answered with sufficiently high certainty. However, we should take 
into account the training effect a given example has on other remaining (unsupervised) examples. 
In other words, we would like to select samples such as to be able to correctly disambiguate 
as many examples as possible in the next iteration. If this is successfully done, the number 
of examples to be supervised will decrease. We consider maximization of this effect by means 
of a training utility function aimed at ensuring that the example with the greatest training 
utility factor, is the most useful example at a given point in time. Intuitively speaking, the 
training utility of an example is greater when we can expect greater increase in the interpretation 



certainty of the remaining examples after training using that example (see Section 3.5.1 for the 
computation of interpretation certainty). 



To explain this notion intuitively, let us take Figure 4.2 as an example corpus. In this corpus, 
all sentences contain the verb yameru, which has two senses according to IPAL, si ("to stop 
(something)") and S2 ("to quit (occupation)"). In this figure, sentences ei and 62 are supervised 
examples associated with the senses si and S2, respectively, and Xj's are unsupervised examples. 
For the sake of enhanced readability, the examples Xj's are partitioned according to their verb 
senses, that is, xi to correspond to sense si, and xq to xg correspond to sense 82- In addition, 
one may notice that examples in the corpus can be readily categorized based on case similarity, 
that is, into clusters {xi, X2, xs, x^} ("someone/something stops service"), {e2, xg, X7} ("someone 
leaves organization"), {x8,xq} ("someone quits occupation"), {ei} and {x^}. Let us simulate the 
sampling procedure with this example corpus. In the initial stage with {ei, 62} in the database, 
xq and X7 can be interpreted as S2 with greater certainty than for other Xj's, because these two 
examples are similar to 62- Therefore, uncertainty sampling selects any example excepting xq 
and xi as the sample. However, it is evident that any of xi to 2:4 is more desirable because 
by way of incorporating one of these examples, we can obtain more Xj's with greater certainty. 
Assuming that xi is selected as the sample and incorporated into the database with sense si, 
either of and xg will be more highly desirable than other unsupervised Xj's in the next stage. 

Let S be a set of sentences, i.e. a given corpus, and D be the subset of supervised ex- 
amples stored in the database. Further, let X be the set of unsupervised examples, realizing 



Equation 4.1 



S = DUX (4.1) 

The example sampling procedure can be illustrated as: 
1. WSD(D,X) 



68 



CHAPTER 4. SELECTIVE SAMPLING OF EFFECTIVE EXAMPLE SETS 



ei 


seito ga (stuuent-iN UM j 


shitsumon wo (question- ACC) 


yameru 


(si) 


62 


am ga (brotner-iNUMj 


kaisha wo (company-ACC) 


yameru 


(■52) 


Xi 


shain ga (employee-NOM) 


eigyou wo (sales-ACC) 


yameru 


(?) 




shouten ga (store-NOM) 


eigyou wo (sales-ACC) 


yameru 


(?) 


x-i 


koujou ga (factory-NOM) 


sougyou wo (operation- ACC) 


yameru 


(?) 


X4 


shisetsu ga (facility-NOM) 


unten wo (operation-ACC) 


yameru 


(?) 


X5 


sensyu ga (athlete-NOM) 


renshuu wo (practice-ACC) 


yameru 


(?) 


X6 


musuko ga (son-NOM) 


kaisha wo (company-ACC) 


yam^eru 


(?) 


X7 


kangofu ga (nurse-NOM) 


byouin wo (hospital-ACC) 


yameru 


(?) 


Xs 


hikoku ga (defendant-NOM) 


giin wo (congressman-ACC) 


yameru 


(?) 


Xg 


chichi ga (father-NOM) 


kyoushi wo (teacher- ACC) 


yameru (?) 



Figure 4.2: Example of a given corpus associated with the verb yameru 

2. e <— arg max^^-j^ TU (x) 

3. D^DU{e}, X^Xn{e} 

4. goto 1 

where ^^^^(D, X) is the verb sense disambiguation process on input X using D as the database. 
In this disambiguation process, the system outputs the fohowing for each input: (a) a set of verb 
sense candidates with interpretation scores, and (b) an interpretation certainty. These factors 
are used for the computation of TU{x), newly introduced in our method. TU{x) computes the 
training utility factor for an example x. The sampling algorithm gives preference to examples 
of maximum utility. 

We will explain in the following sections how one can estimate TU (x), based on the estimation 
of the interpretation certainty. 



4.2.2 Computation of training utility 

The training utility of an example a is greater than that of another example b when the total 
interpretation certainty of unsupervised examples increases more after training with example 



a than with example h. Let us consider Figure [4.3| , in which the x-axis mono-dimensionally 
denotes the semantic similarity between two unsupervised examples, and the y-axis denotes the 
interpretation certainty of each example. Let us compare the training utility of the examples 
a and b in Figure 4.3-a. Note that in this figure, whichever example we use for training, the 
interpretation certainty for each unsupervised example (x) neighboring the chosen example in- 
creases based on its similarity to the supervised example. Since the increase in the interpretation 
certainty of a given x becomes smaller as the similarity to a or 6 diminishes, the training utility 
of the two examples can be represented by the shaded areas. It is obvious that the training 
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utility of a is greater as it has more neighbors than b. On the other hand, in Figure 4.3-b, b has 
more neighbors than a. However, since b is semanticahy similar to e, which is already contained 
in the database, the total increase in interpretation certainty of its neighbors, i.e. the training 
utility of 6, is smaller than that of a. 



certainty 



n 



X X X a X X X 



X b X 



Figure 4.3-a: The case where the training utility 
of a is greater than that of h because a has more 
unsupervised neighbors 



certainty 



X X X a X X X 



xxxexbxxx 



Figure 4.3-b: The case where the training utility 
of a is greater than that of h because h closely 
neighbors e, contained in the database 



Figure 4.3: The concept of training utility 



Let AC(a; = s, y) be the difference in the interpretation certainty of y G X after training with 
a; € X, taken with the sense s. We compute the interpretation certainty of an example using 
Equation 3.10 in Section 3.5.1| . TU {x = s), which is the training utility function for x taken with 



sense s, can be computed by way of Equation 4.2 



TU{x = s) = ^ AC{x = s,y) 
yeX. 



(4.2) 



It should be noted that in Equation 4.2, we can replace X with a subset of X which consists of 
neighbors of x. However, in order to facilitate this, an efficient algorithm to search for neighbors 



of an example is required. We will discuss this problem in Section 4.5 



Since there is no guarantee that x will be supervised with any given sense s, it can be risky 
to rely solely on TU{x = s) for the computation of TU{x). We estimate TU{x) by the expected 
value of x, calculating the average of each TU{x = s), weighted by the probability that x takes 
sense s. This can be realized by Equation where P{s\x) is the probability that x takes the 
sense s. 

TU{x) = J2Pis\x) ■TU{x = s) (4.3) 

s 

Given the fact that (a) P{s\x) is difficult to estimate in the current formulation, and (b) the 
cost of computation for each TU {x = s) is not trivial, we temporarily approximate TU (x) as in 



Equation iA, where K is a set of the fc-best verb sense(s) of x with respect to the interpretation 
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score in the current state. 

TU{x) ^IYI TU{x = s) (4.4) 
4.2.3 Enhancement of computation 

In this section, we discuss how to enhance the computation associated with our example samphng 
algorithm. 

First, we note that computation of TU{x = s) in Equation |4.2| (see Section 4.2.2|) becomes 



time consuming because the system is required to search the whole set of unsupervised examples 
for examples whose interpretation certainty will increase after x is used for training. To avoid 
this problem, we could potentially apply a method used in efficient database search techniques, 
by which the system can search for neighbor examples of x with optimal time complexity |145[| . 
However, in this section, we will explain another efficient algorithm to identify neighbors of x, in 
which neighbors of case fillers are taken as being given directly by the thesaurus structure^. The 
basic idea is the following: the system searches for neighbors of each case filler of x instead of x 
as a whole, and merges them as a set of neighbors of x. Note that by dividing examples along 
the lines of each case filler, we can retrieve neighbors based on the structure of the Bunruigoihyo 
thesaurus (instead of the conceptual semantic space as in Figure ^^). Let J^^^g^c be a subset 
of unsupervised neighbors of x whose interpretation certainty will increase after x is used for 
training, considering only case c of sense s. The actual neighbor set of x with sense s (Nx=s) is 



then defined as in Equation 



N,.=s = UNx=s,c (4.5) 



Figure 4.4 shows a fragment of the thesaurus, in which x and the y's are unsupervised case filler 
examples. Symbols ei and 62 are case filler examples stored in the database taken as senses 
si and S2, respectively. The triangles represent subtrees of the structure, and the labels Ui 
represent nodes. In this figure, it can easily be seen that the interpretation score of si never 
changes for examples other than the children of 71.4, after x is used for training with sense si. In 
addition, incorporating x into the database with sense si never changes the score of examples y 
for other sense candidates. Therefore, N^—s^^c includes only examples dominated by 71,4, in other 
words, examples which are more closely located to x than ei in the thesaurus structure. Since, 
during the WSD phase, the system determines ei as the supervised neighbor of x for sense si, 
identifying N2;=s^^c does not require any extra computational overhead. One may notice that 



^ Utsu ro's method requires the construction of large-scaled similarity templates prior to similarity computa- 
tion |l45|, and this is what we would like to avoid. 
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the technique presented here is not apphcable when the vector space model (see Section |3.3| ) 
is used for the similarity computation. However, automatic clustering algorithms, which give 
a set of words hierarchy based on the similarity between them (for example, one proposed by 
Tokunaga et al. [|139(| ), could potentially facilitate the application of this retrieval method to the 
vector space model. 



"2 

713 "4 




e-i y y y x y 




e2 y y 



Figure 4.4: A fragment of the thesaurus including neighbors of x associated with case c 

Second, sample size at each iteration should ideally be one, so as to avoid the supervision of 
similar examples. On the other hand, a small sampling size generates a considerable computa- 
tion overhead for each iteration of the sampling procedure. This can be a critical problem for 
statistics-based approaches, as the reconstruction of statistic classifiers is expensive. However, 
similarity-based systems fortunately do not require reconstruction, and examples simply have to 
be stored in the database. Furthermore, in each disambiguation phase, our similarity-based sys- 
tem needs only compute the similarity between each newly stored example and its unsupervised 
neighbors, rather than between every example in the database and every unsupervised example. 



Let us reconsider Figure [4.4| . As mentioned above, when x is stored in the database with sense 
si, only the interpretation score of y's dominated by 714, i.e. ^x=si.c^ will be changed with 
respect to sense si. This algorithm reduces the time complexity of each iteration from 0{N'^) 
to 0{N), given that is the total number of examples in a given corpus. 

4.3 Experimentation 

In order to investigate the effectivity of our example sampling method, we conducted an exper- 
iment, in which we compared the following four sampling methods: 

• a control (random), in which a certain proportion of a given corpus is randomly selected 
for training, 

• uncertainty sampling (US), in which examples with minimum interpretation certainty are 



selected §0\ 
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committee-based sampling (CBS) p2|, 



our method based on the notion of training utihty (TU) . 



We elaborate on uncertainty sampling and committee-based sampling in Section |4.4| . We com- 
pared these sampling methods by evaluating the relation between the number of training exam- 
ples sampled and the performance of the system. We conducted six-fold cross validation and 
carried out sampling on the training set. With regard to the training/test data set, we used the 
same corpus as that used for the experiment described in Section |3.4| . Each sampling method 
uses examples from IPAL to initialize the system, with the number of example case fillers for 
each case being an average of about 3.7. For each sampling method, the system uses the Bun- 



ruigoihyo thesaurus for the similarity computation. In Table |3.2| (in Section 3.4), the column 



of "accuracy" for "BGH" denotes the accuracy of the system with the entire set of training 
data contained in the database. Each of the four sampling methods achieved this figure at the 
conclusion of training. 

We evaluated each system performance according to its accuracy, that is the ratio of the 
number of correct outputs, compared to the total number of inputs. For the purpose of this 



experiment, we set the sample size to 1 for each iteration, A = 0.5 for Equation 3.10, and k = 1 for 



Equation 4.4. Based on a preliminary experiment, increasing the value of k either did not improve 



the performance over that for A; = 1, or lowered the overall performance. Figure 4.5 shows the 



relation between the number of the training data sampled and the accuracy of the system. In 



Figure 4.5, zero on the x-axis represents the system using only the examples provided by IPAL. 



Looking at Figure |4.5| one can see that compared with random sampling and committee-based 
sampling, our sampling method reduced the number of the training data required to achieve 
any given accuracy. For example, to achieve a accuracy of 80%, the number of the training 
data required for our method was roughly one-third of that for random sampling. Although 
the accuracy for our method was surpassed by that for uncertainty sampling for larger sizes of 
training data, this minimal difference for larger data sizes is overshadowed by the considerable 
performance gain attained by our method for smaller data sizes. 

Since IPAL has, in a sense, been manually selectively sampled in an attempt to model 
the maximum verb sense coverage, the performance of each method is biased by the initial 
contents of the database. To counter this effect, we also conducted an experiment involving 
the construction of the database from scratch, without using examples from IPAL. During the 
initial phase, the system randomly selected one example for each verb sense from the training 
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set, and a human expert provided the correct interpretation to initialize the system^. Figure 
shows the performance of the various methods, from which the same general tendency as seen 



in Figure 4.5 is observable. However, in this case, our method was generally superior to other 
methods. Through these comparative experiments, we can conclude that our example sampling 
method is able to decrease the number of the training data, i.e. the overhead for both supervision 
and searching, without degrading the system performance. 




200 400 600 800 1000 
no. of training data sampled 



1200 



Figure 4.5: The relation between the number of training data sampled and accuracy of the 
system 



4.4 Related Work 



4.4.1 Uncertainty sampling 



The procedure for uncertainty sampling |8^] is as follows, where C{x) represents the interpreta- 
tion certainty for an example x (see our sampling procedure in Section [4.2.1 for comparison): 



^In order to minimize the potential bias caused by selection of the initial examples, we conducted the same 
trials with three different initial example sets and averaged the results for each trial. 
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1000 



no. of training data sampied 



Figure 4.6: The relation between the number of training data sampled and accuracy of the 
system without using examples from IPAL 
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1. WSD{'D,X) 

2. e <— argmm^.g^ C{x) 

3. D^DU{e}, X^Xn{e} 

4. goto 1 

Let us discuss the theoretical difference between this and our method. Considering Figure 
again, one can see that the concept of training utihty is supported by the following properties: 

(a) an example which neighbors more unsupervised examples is more informative (Figure 4.3- 
a), 

(b) an example less similar to one already existing in the database is more informative (Fig- 
ure 4.3-b). 

Uncertainty sampling directly addresses property (b), but ignores property (a). It differs from 
our method more crucially when more unsupervised examples remain, because these unsuper- 
vised examples have a greater influence on the computation of training utility. This assumption 



can also be seen in the comparative experiments in Section 4.3, in which our method outper- 



formed uncertainty sampling to the highest degree in early stages. 



4.4.2 Committee-based sampling 



In committee-based sampling [^2|, which follows the "query by committee" principle |132| ], the 
system selects samples based on the degree of disagreement between models randomly taken 
from a given training set (these models are called "committee members"). This is achieved by 
iteratively repeating the steps given below, in which the number of committee members is given 
as two without loss of generality: 

1. draw two models randomly, 

2. classify unsupervised example x according to each model, producing classifications Ci and 

C2, 

3. if Ci 7^ C2 (the committee members disagree), select x for the training of the system. 



Figure [4.7| shows a typical disparity evident between committee-based sampling and our 
sampling method. The basic notation in this figure is the same as in Figure |3.9| , and both x and 
y denote unsupervised examples, or more formally D = {ei, 62}, and X = {x, y}. Assume a pair 
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of committee members {ei} and {62} have been selected from the database D. In this case, the 
committee members disagree as to the interpretations of both x and y, and consequently, both 
examples can potentially be selected as a sample for the next iteration. In fact, committee-based 
sampling tends to require a number of similar examples (similar to ei and y) in the database, 
otherwise committee members taken from the database will never agree. This feature provides a 
salient contrast to our method for which a similar example is less informative, and x is therefore 
preferred to y as a sample. This contrast can also correlate to the fact that committee-based 
sampling is currently applied to statistics-based language models (HMM classifiers), in other 
words, statistical models generally require that the distribution of the training data reflects that 
of the overall text. Through this argument, one can assume that committee-based sampling is 
better suited to statistics-based systems, while our method is more suitable for similarity-based 
systems. 



^sense 1 




^sense 2 




X \ 


[ e2 j 


i ^ 







Figure 4.7: A case where either x or y can be selected in committee-based sampling 

Engelson and Dagan criticized uncertainty sampling |^0| , which they call a "single model" 
approach, as distinct from their "multiple model" approach: 

. . . sufficient statistics may yield an accurate 0.51 probability estimate for a class c in 
a given example, making it certain that c is the appropriate classification^ . However, 
the certainty that c is the correct classification is low, since there is a 0.49 chance 
that c is the wrong class for the example. A single model can be used to estimate 
only the second type of uncertainty, which does not correlate directly with the utility 
of additional training. 

We note that this criticism cannot be applied to our sampling method, despite the fact that our 
method falls into the category of a single model approach. In our sampling method, given suffi- 
cient statistics, the increment of the certainty degree for unsupervised examples, i.e. the training 



^By appropriate classification, Engelson and Dagan mean the classification given by a perfectly-trained model. 
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utility of additional supervised examples, becomes small (theoretically, for both similarity-based 
and statistics-based systems). As such, the utility factor can be considered to correlate directly 
with additional training, for our method. 



4.5 Discussion 



4.5.1 Sense ambiguity of case fillers in selective sampling 

The semantic ambiguity of case fillers (nouns) should be taken into account during selective 



sampling. Figure 4.8, which uses the same basic notation as Figure |3.9| , illustrates one possible 
problem caused by case filler ambiguity. Let xi be a sense of a case filler x, and yi and 1/2 be 
different senses of a case filler y. On the basis of Equation 3.10| , the interpretation certainty of 
X and y is small in Figures 4.8-a and 4.8-b, respectively. However, in the situation shown in 
Figure 4.8-b, since (a) the task of distinguishing between the verb senses 1 and 2 is easier, and 
(b) instances where the sense ambiguity of case fillers corresponds to distinct verb senses will 
be rare, training using either yi or y2 will be less effective than using a case filler of the type of 
X. It should also be noted that since Bunruigoihyo fundamentally associates each word with a 
single concept, this problem is not critical in our case. However, given other existing thesauri 
like the EDR electronic dictionary |58] or WordNet |93], these two situations should be strictly 
differentiated. 




Figure 4.8-a: Interpretation certainty of x is small 
because x lies in the intersection of distinct verb 



sense 1 




sense 2 




Figure 4.8-b: Interpretation certainty of y is small 
because y is semantically ambiguous 



senses 



Figure 4.8: Two separate scenarios in which the interpretation certainty of x is small 



4.5.2 Applicability of our selective sampling method 

First, we note that our selective sampling method is expected to be applicable to most similarity- 
based methods, or more precisely, those which follow the nearest neighbor method, although 
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in this chapter, we describe the sampUng method being appUed only to our verb sense disam- 
biguation method (from Chapter ^). However, our sampHng method does not seem effective 
for some similarity-based methods, such as the one proposed by Ng and Lee [|105||. Ng and Lee 



used the similarity measure as shown in Equation 2.15 (see page 20). It should be noted that 
their similarity computation uses the statistics obtained from supervised examples. Selective 
sampling for the supervised examples potentially biases the estimation of P{s\F = f), i.e. the 
conditional probability that sense s occurs, given that feature F takes value /. This issue needs 
to be further explored. On the other hand, our sampling method can be applied to similarity- 
based methods which use resources independent of the supervised examples, for the similarity 
computation [ 
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Second, Figure |4.9| exemplifies a limitation of our sampling method. The basic notation 



is the same as in Figure 3.9. In this figure, the only supervised examples contained in the 
database are ei and 62, and x represents an unsupervised example belonging to sense 2. Given 
this scenario, x is informative because (a) it clearly evidences the semantic vicinity of sense 
2, and (b) without x as sense 2 in the database, the system may misinterpret other examples 
neighboring x. However, in our current implementation, the training utility of x would be small 
because it would be mistakenly interpreted as sense 1 with great certainty due to its relatively 
close semantic proximity to ei. Even if x has a number of unsupervised neighbors, the total 
increment of their interpretation certainty cannot be expected to be large. This shortcoming 
often presents itself when the semantic vicinities of different verb senses are closely aligned or 



their semantic ranges are not disjunctive. Here, let us consider Figure |3.5| again, in which the 



nominative case would parallel the semantic space shown in Figure |4.9| more closely than the 
accusative. Relying more on the similarity in the accusative (the case with greater CCD) as 
is done in our system, we aim to map the semantic space in such a way as to achieve higher 
semantic disparity and minimize this shortcoming. 




Figure 4.9: The case where informative example x is not selected 
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4.6 Summary 

Corpus-based approaches have recently pointed the way to a promising trend in word sense 
disambiguation. However, these approaches tend to require a considerable overhead for supervi- 
sion in constructing a large-sized database, additionally resulting in a computational overhead 
to search the database. To overcome these problems, our method, which is currently applied 
to an similarity-based verb sense disambiguation system, selectively samples a smaller-sized 
subset from a given example set This method is expected to be applicable to other 

similarity-based systems. Applicability for other types of systems needs to be further explored. 

The process basically iterates through two phases: (normal) word sense disambiguation and 
a training phase. During the disambiguation phase, the system is provided with sentences con- 
taining a polysemous verb, and searches the database for the most semantically similar example 
to the input (nearest neighbor method). Thereafter, the verb is disambiguated by superimposing 
the sense of the verb appearing in the supervised example. The similarity between the input 
and an example, or more precisely the similarity between the case fillers included in them, is 
computed based on an existing thesaurus (or the vector space model is alternatively applicable). 
In the training phase, a sample is then selected from the system outputs and provided with 
the correct interpretation by a human expert. Through these two phases, the system iteratively 
accumulates supervised examples into the database. The critical issue in this process is to decide 
which example should be selected as a sample in each iteration. To resolve this problem, we con- 
sidered the following properties: (a) an example which neighbors more unsupervised examples 
is more influential for subsequent training, and therefore more informative, and (b) since our 
verb sense disambiguation is based on nearest neighbor resolution, an example similar to one 
already existing in the database is redundant. Motivated by these properties, we introduced and 
formalized the concept of training utility as the criterion for example selection. Our sampling 
method always gives preference to that example which maximizes training utility. 

We reported on the performance of our sampling method by way of experiments, in which we 



compared our method with random sampling, uncertainty sampling [B0| and committee-based 
sampling [^]. The result of the experiments showed that our method reduced both the over- 
head for supervision and the overhead for searching the database to a larger degree than any 
of the above three methods, without degrading the performance of verb sense disambiguation. 
Through the experiment and discussion, we claim that uncertainty sampling considers property 
(b) mentioned above, but lacks property (a). We also claim that committee-based sampling dif- 
fers from our sampling method in terms of its suitability to statistics-based systems as compared 
to similarity-based systems. 
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Chapter 5 



Exploration of Similarity 
Computation 

This chapter explores the computation of word similarity, on which the performance 
of our similarity-based verb sense disambiguation system is highly dependent. The 
statistics-based computation of word similarity has been popular in recent research, 
but is associated with significant computational cost. On the other hand, the use 
of hand-crafted thesauri as semantic resources is simple to implement, but lacks 
mathematical rigor. To integrate the advantages of these two approaches, we aim 
at calculating a statistical weight for each branch of a thesaurus, so that we can 
compute word similarity based simply on the length of the path between two words in 
the thesaurus. Our experiment on Japanese nouns shows that this framework upheld 
the inequality of statistics-based word similarity with an accuracy of more than 70%. 
We also applied our framework of word similarity computation to our verb sense 
disambiguation system. 

5.1 Motivation 

As with other similarity-based systems, the performance of om' system (as described in Chap- 
ter |3|) is highly dependent on the computation of the (relative) similarity between two examples, 
which is further decomposed into the similarity between two case filler nouns (we shall call 
this "word similarity", hereafter). By the performance of the system, we refer to the following 
different viewpoints: 

• the relative number of correct outputs ("accuracy"). 
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• the cost needed for the similarity computation ("efficiency"). 

We described two previous approaches for word similarity computation in Section |3.3| . Here, let 
us summarize these approaches again. 

The first approach for word similarity computation is statistics-based methods [11, 



6|, |113| , |139| ]. From a number of different implementations, we used the vector space model 



(VSM) [T^, 106, 127, 13C], for our verb sense disambiguation system. In the vector space 
model, each word is represented by a vector consisting of co-occurrence statistics, such as relative 
frequency, with respect to other words. Note that in our case, we used only the co-occurrence 
between a noun and verbs. The similarity between two given words is then measured computa- 
tionally using the two vectors representing those words. One typical implementation computes 
the similarity as the cosine of the angle between the two vectors, a method which is also com- 
monly used in information retrieval and text categorization systems to measure the similarity 
between documents |127| ]. Since it is based on mathematical rationale, this type of similarity 
measurement has been popular. Besides this, since the similarity is computed based on given 
co-occurrence data, word similarity can easily be adjusted according to the domain. However, 
data sparseness is an inherent problem. This fact was observed in our preliminary experiment 
(see Section |3.4[ ), despite using statistical information taken from news articles for as many as 
four years. Furthermore, in this approach, each vector requires 0{N) computational cost, given 
that N is the length of the vector. Note that although, in our case, each vector term is confined 
to a verb, additional parts-of-speech for vector terms are problematic. In particular, the vector 
length can potentially be overly great when the vectors comprise noun terms, because nouns are 
generally constitute an open class. 

The other category of word similarity approach uses a hand-crafted thesaurus (such as Ro- 



get's thesaurus |jT^ or WordNet | |93[ | in the case of English, and Bunruigoihyo [102] or EDR [^] 
in the case of Japanese), based on the intuitively feasible assumption that words located near 
each other within the structure of a thesaurus have similar meaning. Therefore, the similarity 
between two given words is represented by the length of the path between them in the thesaurus 



structure |77, 81, 142]. Unlike the former approach, the required computational cost can be 



restricted to constant order, because only a list of semantic codes for each word is required. For 



example, the commonly used Japanese Bunruigoihyo thesaurus 1102] represents each semantic 
code with only 7 digits^. However, computationally speaking, the relation between the similar- 
ity (namely the semantic length of the path) , and the physical length of the path is not clear^ . 



^The revised version of the Bunruigoihyo thesaurus assigns an 8 digit code each word. 

^Most researchers heuristically define functions between the similarity and physical path length JFi] , [ttI , 142] 
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Furthermore, since most thesauri aim at a general word hierarchy, the similarity between words 
used in specific domains (technical terms) cannot be measured to the desired level of accuracy. 
To sum up, the use of hand-crafted thesauri is preferable to statistics-based approaches in 



terms of "efficiency" . As for the "accuracy" , while our experiments in Section 3.4 quantitatively 
showed that the use of the Bunruigoihyo thesaurus outperformed the vector space model, there 
has been no qualitative explanation for this general tendency. In addition, new and improved 
models can potentially be introduced for statistics-based approaches. In view of these arguments, 
we aim at intergrating the advantages of the two above methodological types, or more precisely, 
realizing statistics-based word similarity based on the length of the thesaurus path [^]. Conse- 
quently, our approach allows us to measure the statistics-based word similarity, while retaining 
optimal computational cost (0(1)). The crucial concern in this process is how to determine 
the 'statistics-based length' (SBL) of each branch in a thesaurus. We tentatively use the Bun- 
ruigoihyo thesaurus, in which each word corresponds to a leaf in the tree structure. Let us take 



Figure 5.1, which shows a fragment of the thesaurus. In this figure, Wj's denote words and Xj's 



denote the statistics-based length of each branch i. Let the statistics-based (vector space model) 
word similarity between wi and W2 be vsm{wi, 102). We hope to estimate this similarity by the 
length of the path through branches 3 and 4, and derive an equation "X3 + X4 = sim{wi,W2)" ■ 
Intuitively speaking, any combination of X3 and X4 which satisfies this equation can constitute 
the SBLs for branches 3 and 4. Formalizing equations for other pairs of words in the same 



manner, we can derive the set of simultaneous equations shown in Figure |5.2| . That is, we can 
assign the SBL for each branch by way of finding answers for each Xj. 



equation set 



\ 



subset 1 



subset2 




Figure 5.1: A fragment of the thesaurus associated with xi to xq 
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Xl + X2 + X3 + X^ 


= vsm{wi 




Xl + X2 + X3 + Xq 


= vsm{wi 




Xl+ X2+ Xi + X^ 


= vsm{w2 


W3) 



Figure 5.2: A fragment of the set of simultaneous equations associated with Figure 5.1 



In Section 5.2, we elaborate on the methodology of our word similarity measurement. We 
then evaluate our method by way of an experiment in Section and apply this method to the 



task of word sense disambiguation in Section 5.4 



5.2 Methodology 
5.2.1 Overview 

Our word similarity measurement proceeds in the following way: 

1. compute the statistics-based similarity of every combination of given words, 

2. set up a set of simultaneous equations through use of the thesaurus and previously com- 
puted word similarities, and find solutions for the statistics-based length (SBL) of the 



corresponding thesaurus branch (see Figures 5J and |5^ 



3. the similarity between two given words is measured by the sum of SBLs included in the 
path between those words. 



For step 1, we used the vector space model, for which the reader is referred to Section 3.3.1 
However, note that our framework is independent of the implementation of the similarity com- 
putation. We elaborate on steps 2 and 3 in the following sections. 

5.2.2 Resolution of the simultaneous equations 



The set of simultaneous equations used in our method is expressed by Equation 5.1, where A 
is a matrix comprising only the values and 1, and B is the list of vsm^s (i.e. statistics-based 
similarities) for all possible combinations of given words. X is a list of variables, which represents 
the statistics-based length (SBL) for the corresponding branch in the thesaurus. 



AX = B 



(5.1) 
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Here, let the i-th similarity in B be vs'm(a, b), and let path{a, b) denote the path between words 
a and h in the thesaurus. Each equation contained in the set of simultaneous equations is 
represented by Equation 5^, where Xj is the statistics-based length (SBL) for branch j, and aij 



is either or 1, as in Equation 5.3 



a 



Xl 
X2 



vsm{a, b) 



(5.2) 



a 



1 if i £ path{a, b) 
otherwise 



(5.3) 



By finding the solutions for X, we can assign SBLs to branches. However, the set of similarity 
values outnumbers the variables. For example, the revised version of the Bunruigoihyo thesaurus 
contains about 55,000 noun entries, and therefore, the number of similarity values for those 
nouns becomes about 1.5x10^ (C^^''^'^'^). On the other hand, the number of the branches is only 
about 53,000. As such, overly many equations are redundant, and the time complexity to solve 
the simultaneous equations becomes a crucial problem. To counter this problem, we randomly 
divide the overall equation set into equal parts, which can be solved reasonably. Thereafter we 
approximate the solution for x by averaging the solutions for x derived from each subset. Let 



us take Figure 5.3, in which the number of subsets is given as two without loss of generality. 
In this figure, xn and Xi2 denote the answers for branch i individually derived from subsets 1 
and 2, and is approximated by the average of Xn and Xi2 (that is, Sii±^y To generalize 
this notion, let Xij denote the solution associated with branch i in subset j. The approximate 



solution for branch i is given by Equation 5.4, where n is the number of divisions of the equation 
set. 



1 " 



(5.4) 



With regard to resolving the simultaneous equations, we used the mathematical analysis tool 
"MATLAB"3. 



^Developed by Cybernet System, Inc. 
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equation set 



subset 1 



\ 



subset2 




Figure 5.3: Approximation of the statistics-based length Xi 



5.2.3 Word similarity using SBL 



Let us reconsider Figure |5.l| . In this figure, the similarity between wi and W2, for example, is 
measured by the sum of X3 and X4 . In general, the similarity between words a and b using SBL 



{sbl{a, b), hereafter) is realized by Equation 5^, where Xi is the SBL for branch i, and path{a, b) 
is the path that includes thesaurus branches located between a and b. 



sbl(a, b) = ^ ; 

iSipath{a,b) 



(5.5) 



5.3 Experimentation 

Here, we evaluate the degree to which simultaneous equations are successfully approximated 
through the use of the technique described in Section ^.2[ In other words, we analyze to what 
extent the (original) statistics-based word similarity can be realized by our framework. We 
conducted this evaluation in the following way. Let the statistics-based similarity between 
words a and b be vsm{a, b), and the similarity based on SBL be sbl{a, b). Here, let us assume the 
inequality "vs'm{a, b) > vsm{c, d)" for words a, b, c and d. If this inequality can be maintained for 
our method, that is, ^^sbl{a,b) > sbl{c,d)", the similarity measurement is taken to be successful. 
The accuracy is then estimated by the ratio between the number of successful measurements 
and the total number of trials. Since resolution of equations is time-consuming, we tentatively 
generalized 23,223 nouns into 303 semantic classes (represented by the first 4 digits of the 
semantic code given in the Bunruigoihyo thesaurus), reducing the total number of equations to 
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45,753. Figure 5.4 shows the relation between the number of equations used and the accuracy: 
we divided the overah equation set into n equal subsets^ (see Section 5.2.2| ), and progressively 
increased the number of subsets used in the computation. When the whole set of equations 
was provided, the accuracy became about 72%. We also estimated the lower bound of this 
evaluation, that is, we also conducted the same trials using the Bunruigoihyo thesaurus. In this 
case, if word a is more closely located to b than c is to d and "vsm{a, b) > vsm{c, d)" , that trial 
measurement is taken to be successful. We found that the lower bound method produced an 
accuracy of about 56%, and therefore, that our framework outperformed this method. 
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Figure 5.4: The relation between the number of equations used and the accuracy 



5.4 Application to verb sense disambiguation system 

We further evaluated our word similarity technique in the task of verb sense disambiguation. 
The evaluation methodology is simple, that is, the performance of word similarity measurement 
is evaluated in the context of the improvement on our similarity-based system. In practice, 



^We arbitrarily set n = 15 so as to be able to resolve equations reasonably. 
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Equation |5.5| is used as a substitution for sim in Equation (see Section p.3| for details). As 



performed in Section this experiment involved six-fold cross validation, that is, we divided 
the training/test data into six equal parts, and conducted six trials in which a different part was 
used as test data each time, and the rest as the database. We evaluated the performance of the 
system according to its accuracy, that is the ratio of the number of correct outputs, compared 
to the total number of inputs. We used as training/test data the same corpus as in Section 
where sentences contain one of eleven polysemous verbs, producing a total of 1315 sentences 
(see Table ^^). We found that the combined performance with SBLs did not differ from that 



for the Bunruigoihyo thesaurus (see the column "BGH" in Table 3.2), which means our method 
still finds it difficult to enhance existing hand-crafted thesauri. However, it should be noted 



that given the inferiority of the vector space model as shown in Table 3.2, our word similarity 
measure cannot theoretically exceed the performance for the Bunruigoihyo thesaurus, because 
the objective of SBL is to simulate word similarity based on the vector space model. The use 
of SBL is expected to improve other applications where the vector space model outperforms 
hand-crafted thesauri. 



5.5 Related Work 

In this chapter, we focused solely on the "efficiency" problem related to the word similarity 
computation, disregarding the "accuracy" problem. Resnik |122| ] (later enhanced by Jiang and 
Conrath [^]) integrates a thesaurus taxonomy and information content for similarity compu- 
tation, focusing on the accuracy problem. Resnik's method computes similarity between given 



word classes (synsets) taken from WordNet [33[, based on the degree of commonality between 



them. In other words, the more information two synsets share in common, the more similar 
they are. Formally speaking, the similarity between two synsets ci and C2 is computed by 
Equation |5!6| . 

sim{ci,C2)= max [— logP(c)] (5-6) 

cg5(ci,C2) 

Here, <S(ci, C2) is the set of synsets that dominate both ci and C2- P{c) is estimated based on the 
distribution of words associated with synset c, obtained from a corpus. Resnik [|122(| (and Jiang 
and Conrath ||5^) showed the effectivity of this method using 30 pairs of words, which seems a 
relatively small data size. We applied Resnik's method to our verb sense disambiguation system, 
using the Bunruigoihyo as a core thesaurus. We conducted the same procedure as performed 



in Section 3.4, and found that the accuracy of verb sense disambiguation was about 23%. It 



should be noted that this result represents only one example among the numerous usages of 
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word similarity, and does not discount the validity of Resnik's method. However, we claim 
that enhancement of the accuracy problem for similarity computation still remains difficult, and 
needs to be further explored. 

5.6 Summary 

The previous methods for word similarity computation can be divided into two approaches. 
Statistics-based approaches have been popular because of their mathematical rigor. However, 
these approaches generally require prohibitive computational cost. Remaining approaches, which 
use a hand-crafted thesaurus, are computationally cheaper. However, the quality of these ap- 
proaches is highly dependent on a priori human judgement. In this chapter, we proposed a 
new method to integrate two different approaches for word similarity computation ||3^. That 
is, we realized statistics-based word similarity computation with complexity equivalent to that 
for thesaurus-based approaches. Our objective is to determine the statistical weight for each 
thesaurus branch, so that one can measure statistics-based similarity by simply traversing the 
thesaurus structure. For this purpose, we set up a set of simultaneous equations, in which each 
variable and value correspond to a thesaurus branch and statistics-based similarity, respectively. 
By resolving the simultaneous equations, we can expect to assign appropriate statistical weights 
to each branch. By way of this method, we can expect to optimize the computational efficiency 
required for our verb sense disambiguation. 
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Chapter 6 

Conclusion 



6.1 Contribution 

In this research, we targetted various problems associated with corpus-based word sense disam- 
biguation, focusing on verbal polysemy. Let us summarize the main points that have been made 
in this research. 

• First, in Chapter ^, we described our similarity-based verb sense disambiguation system 
and demonstrated its effectivity. One of the major features of this chapter was that we 
modeled and computationally implemented the linguistic behavior of 'case contribution to 
disambiguation' (CCD). Through comparative experiments, we confirmed that the per- 
formance of our system was improved by considering the CCD factor. Our experiments 
also showed that a hand-crafted thesaurus (in our case, the Bunruigoihyo thesaurus [|l02f| ) 
is an effective resource for word sense disambiguation. Due to the lack of a large-sized 
sense-annotated test collection for Japanese, we used a relatively small-sized corpus for 
our experimentation. However, our experimental results are expected to reflect general 
performance tendencies to a certain degree. We further enhanced our system by way of 
computation of the degree of interpretation certainty, so as to obtain more reliable outputs. 
In addition, we proposed a prototype method of propagating contextual constraints when 
a unique polysemous verb appears in an input sentence. It should be noted that while our 
experiments were carried out on Japanese, our proposed system can be applied to other 
languages. 

• Although the performance of our similarity-based system proved to be relatively satisfac- 
tory, we identified the overhead for manual sense annotation (overhead for supervision) 
and the overhead searching a database (overhead for search) as being limitations on oper- 
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ational applications. Given this observation, in Chapter we proposed a novel sampling 
method, which minimizes the overhead for supervision and the overhead for search. Note 
that these overheads can be major drawbacks for corpus-based NLP approaches in general. 
Our sampling method is characterized by its reliance on the notion of 'training utility': the 
degree to which each example is informative for future sampling when incorporated into 
the database. Comparative experiments showed that our sampling method reduced both 
overheads to a larger degree than any existing sampling methods, without degrading the 
performance of verb sense disambiguation. It should be noted that example sets sampled 
by our method can also be used as initial training data to further improve the quality of 
unsupervised learning, as occurs in bootstrapping. 

• Finally, in Chapter |5|, we proposed a new method for word similarity computation, given 
that the performance of our system is highly dependent on the similarity computation be- 
tween given words. Our method integrates the advantages of the two different approaches 
for similarity computation: the mathematical rigor of statistics-based approaches and 
easy implementation of thesaurus-based approaches. The core process is to determine the 
'statistics-based length' (SBL) of each branch in a thesaurus, reflecting statistically com- 
puted word similarities. In this way, our method realizes statistics-based word similarity 
by simply traversing the thesaurus structure. 



6.2 Outstanding Issues 

Out standing issue is the problem of balancing qualitative and quantitative approaches H, |69| , 
|142(| (see Chapter 2.2.1 for a description of these approaches). Among a number of past re- 



search attempts to tackle this issue, Uramoto [142|, for example, used selectional restrictions 



and scalable similarity based on a thesaurus taxonomy, as (relatively shallow) constraints and 
preference conditions, respectively. However, given that Uramoto did not compare his integrated 
approach with monotonic approaches, the empirical study of balancing different approaches re- 
quires further exploration and is expected to improve on current word sense disambiguation 
methods. 

Another direction would be the establishment of standardized evaluation criteria for word 
sense disambiguation systems, which is expected to streamline system comparison/enhancement^. 
However, as discussed in Section numerous problems exist associated with evaluation method- 



^ ACL SIGLEX is going to hold a workshop (co-ordinated by Adam Kilgarriff) in September 1998, to initiate the 
standardized evaluation of word sense disambiguation systems for English, French, German, Italian and Spanish. 
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ologics. Ultimately speaking, wc think that task-based (application-based) evaluation is the most 
straightforward, because the performance of a system is directly related to benefits to real-world 
applications. In addition, word sense distinction, which can be one major obstacle in standard- 
ized evaluation, can be fixed relatively easily for specific applications. However, on the other 
hand, numerous sense-annotated corpora are required for different applications. Our sampling 
method (as well as past efforts targetting unsupervised methods) are expected to reduce the 
human overhead for establishing these types of sense-annotated corpora. 
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